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ABSTRACT 


In  processing  distributed  relational  database  queries, 
the  cost  of  communication  between  sites  is  the  dominant  cost 
factor.  It  is  generally  agreed  that  the  amount  of  data 
transferred  determines  this  communication  cost  to  a  large 
extent.  Thus  it  is  desirable  to  minimize  the  amount  of 
transmitted  data. 

Semi- join  is  a  relational  database  operator  which  can 
be  utilized  to  reduce  the  amount  of  data  transmission  in 
processing  distributed  queries.  A  class  of  queries,  called 
tree  queries,  can  always  be  answered  using  only  semi-joins. 
This  thesis  is  concerned  with  two  related  problems; 
determining  tree  query  membership  of  a  distributed  query, 
and  finding  the  optimum  sequence  of  semi- joins  to  answer  a 
tree  query. 

Distributed  database  queries  involving  join  clauses 
with  relational  operators  {>,>,=, ^ ,<,< }  are  considered.  A 
canonical  representation  of  a  distributed  query,  called 
reduced  join  graph,  is  introduced.  The  query  represented  by 
a  reduced  join  graph  does  not  contain  any  redundant  join 
clauses,  and  is  equivalent  to  the  original  query.  A 
conceptually  simple  and  efficient  algorithm  is  presented 
which  determines  whether  or  not  such  a  query  is  a  tree 


v 


. 


query.  For  any  tree  query,  the  algorithm  produces  an 
equivalent  query  and  its  tree  query  graph  so  that  the 
sequences  of  semi-joins  to  answer  the  original  query  can 
immediately  be  obtained.  An  implementation  of  the  algorithm 
is  outlined  and  its  time  and  space  requirements  are 
e  s tabl 1  shed . 


Distributed  query  optimization  for  tree  queries  is  the 
second  problem  studied  in  this  thesis.  A  class  of  tree 
queries  with  equi-join  clauses  are  considered.  In  general, 
there  are  numerous  semi-join  sequences,  called  strategies, 
to  evaluate  a  tree  query,  and  these  strategies  may  differ 
significantly  in  the  amount  of  required  data  transmission. 

In  order  to  find  optimum  strategy,  first,  a  set  of 
properties  is  given  to  eliminate  the  strategies  that  can 
never  be  the  optimum;  then  a  dynamic  programming  approach  is 
used  to  find  the  optimum  strategy  among  the  potentially 
optimum  ones. 
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CHAPTER  1 


INTRODUCTION 

Distributed  database  management  systems  (DBMSs)  allow  a 
collection  of  data  to  be  stored  at  multiple  locations  and 
accessed  as  a  single  unified  database.  In  applications 
requiring  access  to  an  integrated  database  from 
geographically  dispersed  locations,  a  distributed  DBMS  has 
the  following  major  advantages  over  a  conventional 
centralized  DBMS  [19]:  (1)  it  is  more  reliable  since  it  is 

supported  by  multiple  computers  at  multiple  locations,  (2) 
it  provides  faster  access  by  storing  data  at  locations  where 
it  is  frequently  used,  and  (3)  the  capacity  can  be  adjusted 
more  easily  to  changing  needs.  These  advantages  and  the 
growing  field  of  applications  of  distributed  databases  have 
contributed  to  the  need  for  the  development  of  general 
purpose  distributed  DBMS.  An  overview  of  the  technical 
problems  associated  with  the  development  of  a  general 
purpose  distributed  DBMS  and  a  survey  of  the  research  and 
development  efforts  to  overcome  these  problems  can  be  found 
in  [19] . 
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An  important  problem  to  be  overcome  m  developing  a 
general  purpose  distributed  DBMS  is  to  find  efficient 
strategies  for  query  processing.  Since  a  distributed  DBMS 
allows  its  users  to  access  geographically  dispersed  data  as 
a  single  unified  database,  user  queries  may  reference  data 
located  at  multiple  sites  of  the  distributed  database. 
Processing  such  a  query  requires  the  transmission  of  the 
data  between  different  sites  via  the  communication  network. 
The  data  transmission  and  the  communications  between  sites 
are  the  primary  sources  of  delay  in  query  processing  in  a 
distributed  database  environment.  In  contrast,  the  primary 
sources  of  delay  in  a  centralized  database  environment  are 
secondary  accesses  and  CPU  time.  Hence  query  processing 
strategies  for  centralized  DBMS  work  poorly  in  processing 
distributed  queries.  This  thesis  is  concerned  with  the 
problem  of  designing  efficient  strategies  for  processing 
distributed  database  queries. 

1 . 1  Relational  Databases 

In  this  thesis,  a  relational  database  [6,8,9]  with 
relations  distributed  over  different  sites  is  assumed. 
Belov/,  the  terminology  associated  with  the  relational 
database  model  is  briefly  reviewed. 

A  relational  database  is  a  collection  of  two 
dimensional  tables  called  relations  (e.g.  Fig . 1 . 1 ) .  The 


columns  of  each  table  are  labeled  by  a  set  of  distinct 
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SUPP: 


PROJ  : 


ORD  : 

- 1 - 

s# 

Sname 

City 

J# 

s#  |  P# 

Qty 

-l-  4- 

_  4-  4-  _  4- 

sl 

Bill 

Ch icaqo 

1 

S2  J  P1 

500  | 

1 

S2 

Joe 

Edm. 

J2 

S2  j  P2 

1 

200  1 

| 

S3 

John 

Edm. 

J2 

1 

150  I 

| 

S4 

Doe 

Chicago 

J3 

S1  |  P1 

1 

200  | 

J# 

City 

L. 

!  Jx 

Chicago 

1 

J2 

1 

Edmonton  1 

1 

J3 

1 

Anka  ra 

Figure  1.1.  An  Example  of  a  Relational  database 
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a  ttr lbu  tes .  The  values  of  the  entries  in  a  column  of  a 
relation  are  drawn  from  a  set  of  values  called  the  doma in  of 
the  attribute  associated  with  that  column.  Each  row  of  a 
relation  is  called  a  tuple.  Hence  a  relation  can  also  be 
viewed  as  a  set  of  tuples,  each  tuple  having  the  same  set  of 
attributes.  A  relation  R  with  a  set  of  attributes 
{ a^ , . . . , an)  is  denoted  by  R(a^,...,an).  An  attribute  a  of  a 
relation  R  is  denoted  R.a.  The  cardinal  1 ty  of  a  relation  R, 
denoted  IrI,  is  the  number  of  tuples  in  R. 

The  relational  query  language  used  in  this  thesis  is 
relational  algebra  [7].  The  relational  algebra  is  a 
collection  of  operations  that  deal  with  relations  yielding 
new  relations  as  a  result.  The  relational  algebra  operations 
used  in  this  thesis  are  projection,  selection,  join  and 
s  emi- join . 

Projection;  Given  a  relation  R  on  a  set  of  attributes  X,  the 
projection  of  R  over  a  set  of  attributes  T= { t^ , . . . , t-^ } ,  TCX, 
selects  the  columns  of  R  labelled  by  the  elements  of  T  and 
then  eliminates  the  duplicate  rows.  It  is  denoted  by  R[T]  or 
R[t1,...,tk]  (e.g.  Fig.  1.2(a)). 

Selection  (also  called  theta-selection  or  restriction):  The 
selection  of  a  relation  R  selects  rows  from  R  that  satisfy  a 
given  qualification  q,  and  is  denoted  by  z^(R)  (e.g.  Fig. 

1.2(b)).  The  qualification  q  is  a  conjunction  of  sel ection 


clauses  of  the  form  (R.a  9  constant)  where  a  is  an  attribute 
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SUPP[ City] : 


City 

Chicago 

Edrn. 


(a)  Projection 


z  ( STOCK) : 
q 


s# 

P# 

Qoh 

4-  4- 

S1 

h 

400  | 

| 

S2 

P3 

1 

100  | 

(b)  Selection 


q=( STOCK. Qoh  <  500) 


STOCK  Join  ORD: 

- q 


J# 

s# 

P# 

Qty 

Qoh 

4-.  4-  4-  4-  .. 

J2 

So 

z 

po 

200 

500  | 

1 

J2 

s. 

pi 

150 

1 

400  | 

1 

J3 

S1  1 

pi 

200 

1 

400  | 

q  =  (ORD.S#  =  STOCK. S#) .and. (ORD. P# 
.and. (ORD .Qty  <  STOCK. Qoh) 


STOCK  semi-join  ORD: 

- -i - q 


s# 

P# 

Qoh 

4-  4-. 

S2 

P2 

500 

S1 

P1 

400 

STOCK. P# ) 


(c)  Join  and  Semi-join 
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of  R^,  ©S  {>,>,=,  i=-  ,  <,< }  and  is  applicable  to  the  constant  and 
the  domain  of  a. 


Join:  The  join  of  two  relations  R.  and  R  denoted  R.  join 
-  l  j  1  - 

R j ,  is  obtained  by  concatenating  the  tuples  of  R^  and  the 


tuples  of  Rj  for  which  the  join  qualification  q  is  true, 
(e.g.  Fig.  1.2(c)).  The  join  qualification  q  is  a 
conjunction  of  join  clauses  of  the  form  (R^.a  ©  R^.b)  where 


a  and  b  are  attribute: 


and  are 


n 


s  of  R^  and  R^  respectively 
defined  on  a  common  domain.  The  symbol  ©  in  the  join  clause 
is  one  of  the  relational  operators  {>,>,=, <, <}  and  is 
applicable  to  the  common  domain  of  the  attributes  a  and  b. 
The  join  of  R^  and  R^  on  qualification  q  is  called  equi- joi 
if  all  the  relational  operators  in  q  are  equality,  i.e.  q  is 
a  conjunction  of  equi-join  clauses,  since  equi-join  yields  a 
result  which  necessarily  contains  two  sets  of  identical 
columns,  the  redundant  columns  are  eliminated  after  the 
equi-join.  Similarly,  join  of  R^  and  R^  on  q  contains  two 
identical  columns  for  each  equi-join  clause  in  q.  The 
notation  R.join  R.  denotes  the  result  of  the  join  after  the 

i-1 - q  3 

redundant  columns  corresponding  to  equi-join  clauses  in  q 
are  eliminated. 


Semi-join:  The  semi-join  of  a  relation  R.  by  a  relation  R. 

- «! -  1  J 

on  a  join  qualification  q,  denoted  R^semi- j oin^R ^ ,  or 
^ — R  ,  is  the  projection  of  Rejoin  R.  over  the  attributes 


R 


of  R^ ,  i.e.  R^-Semi-join  R  j=  (  R^join^R  ^  )  [attr  lbu  tes  of  R^]  . 
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A  query  is  a  runction  applied  to  reiations  to  retrieve 
information  from  tne  database  [24J  .  In  this  thesis,  we 
consider  queries  which  consist  of  a  qualification  and  a 
target  list. 

hxample  1.1:  Consider  tne  dataoase  given  in  Pig.  1.1  and  tne 
following  query  statement: 

"List  tne  names  of  suppliers  who  nave  orders  from 
tne  projects  that  are  located  in  the  same  city  as 
the  suppliers  and  tne  quantity  ordered  is  greater 
tnan  5uu." 

The  qualification  of  tnis  query  can  be  expressed  by  tne 
following  conjunction  of  selection  and  join  clauses: 

q= (SUpP.S#=ORD.S#)  .and ♦  (GAD . J#  =PRUJ . J  # ) 

.and.  (SUpp. city=PRGJ .City )  .and.  (GKL.gty  >  '5wu') 

Tne  target  list  of  tne  query  consists  of  an  attribute, 
namely,  SUPP.JNiame.  # 

Using  nign  level  query  languages,  users  formulate 
queries  in  terms  of  the  content  of  tne  information  to  be 
retrieved  without  reference  to  system  oriented  complexities. 
Since  sucn  queries  describe  wnicn  information  is  to  be 
retrieved  rather  tnan  now  it  is  to  be  retrieved,  it  is 
DBMS's  tasK  to  find  efficient  query  processing  strategies. 


A  number  or  algorithms  nave  been  proposed  to  optimize 
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the  query  processing  or  to  achieve  a  satisfactory  degree  of 
efficiency  in  processing  relational  queries  in  a  centralized 
database  [18,  21,  27,  28].  The  algorithms  in  [21,  27,  28] 
are  based  on  heuristics  (e.g.  elimination  of  common 
subexpressions,  etc.)  that  transform  a  query  into  an 
equivalent  one  which  is  easier  to  evaluate  but  not 
necessarily  the  optimal  one  ever  all  possible  evaluations  of 
the  query.  In  [28],  Yao  analyses  the  alternative  ways  to 
evaluate  queries  involving  selection  join  and  projection. 

Optimization  of  centralized  relational  database  queries 
involves  two  major  difficulties.  First,  there  is  an 
exponential  number  of  different  formulations  of  a  given 
query  among  which  the  one  that  can  be  evaluated  in  minimum 
time  is  to  be  chosen.  Second,  it  is  difficult  to  define  the 
cost  function  in  terms  of  complex  properties  of  the  storage 
structures  and  the  access  mechanisms  in  a  centralized 
database  system.  A  general  treatment  of  query  optimization 
in  a  relational  database  can  be  found  in  chapter  6  of  [24]. 

Though  the  research  on  optimizing  relational  queries  in 
a  centralized  database  is  an  ongoing  one,  the  query 
processing  strategies  for  centralized  DBMS  do  not  extend 
well  to  distributed  DBMS  due  to  the  differences  in  cost 
criterion,  as  discussed  in  the  following  section. 


' 
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1 • 2  Distributed  Query  Processing 


As  an  example  of  a  distributed  database  query,  consider 
the  query  given  in  example  1.1,  and  assume  that  the 
relations  referenced  by  the  query  reside  at  different  sites. 

Processing  such  a  query  involves  transmission  of  relations 

/ 

between  sites  as  well  as  operations  such  as  join  of  the  two 
relations  residing  at  the  same  site,  projection  and 
selection  which  do  not  require  any  data  transmission.  Query 
processing  which  do  not  require  any  data  transmission  is 
called  local  processing,  and  all  the  inter-site  data 
transmissions  are  called  communications ♦  In  distributed 
query  processing,  it  is  common  to  assume  that  the  cost  of 
communications  dominates  the  cost  of  local  processing  [3, 

11,  13,  19,  26].  Hence,  minimizing  the  amount  of  data  to  be 

transmitted  in  processing  a  distributed  query  is  of  prime 
importance  [19].  That  is,  if  each  relation  is  stored 
entirely  at  one  site  then  the  costs  of  selection  and 
projection  operations  are  negligible  since  they  are  unary 
operations.  If  the  two  relations  to  be  joined  reside  at 
different  sites  then  such  a  join  is  the  most  costly 
operation  since  one  of  the  relations  is  to  be  transmitted  to 
the  site  of  the  other  relation.  On  the  other  hand,  a 
semi- join  involving  two  relations  at  different  sites  can  be 
computed  by  transmitting  only  the  projection  of  one  of  the 
relations  over  its  common  joining  attributes  to  the  site  of 
the  other  relation.  Since  a  semi-join  (e.g.  Ih  semi- join  Rj 
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can  be  computed  with  much  less  data  transmission  than  the 
corresponding  join  (e.g.  join^R ^  which  is  computed  by 

transmitted  R^  to  the  site  of  R^),  semi-joins  can  be 
effectively  utilized  in  distributed  query  processing  to 
reduce  the  amount  of  data  transmission.  In  [13,  26] 
semi-joins  are  used  to  reduce  the  cost  of  subsequent  join 
operations  between  relations.  In  this  thesis,  the  semi-joins 
are  further  investigated  for  processing  distributed  database 
queries  efficiently.  Below  is  a  brief  review  of  the  previous 
research  on  distributed  database  query  processing. 

A  considerable  amount  of  research  has  been  done  in 
minimizing  or  reducing  the  amount  of  data  transmission  in 
distributed  query  processing  [3,  4,  11,  13,  26].  For  query 

processing  in  SDD-1  system  [20],  Wong  has  proposed  an 
algorithm  [26]  which  obtains  a  local  optimum  solution. 

Wong's  algorithm  can  be  summarized  in  three  steps: 

(1)  Perform  as  much  local  processing  of  the  query  as 
possible  before  any  data  transmission, 

(2)  Find  the  initial  feasible  strategy,  i.e.,  the 
strategy  with  minimum  cost  among  all  strategies  which  move 
all  the  relations  referenced  by  the  query  to  one  of  the 
sites  in  parallel  without  any  intervening  local  processing. 

(3)  Refine  the  initial  feasible  strategy  successively  by 
local  hill  climbing  techniques. 

An  improved  version  of  Wong's  algorithm  using  branch  and 
bound  techniques  is  given  in  [11] . 


. 
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Hevner  and  Yao  [13]  studied  the  problem  under  the 
independence  assumption  of  domains  in  a  relational 
distributed  database.  Two  cost  criteria  are  used:  "total 
time"  and  "response  time"  corresponding  to  cost  of  data 
transmissions  with  and  without  considering  the  parallel 
transmissions  respectively.  Their  approach  results  in  an 
optimum  strategy  (for  both  cost  criteria)  for 
simple  queries4,  involving  single  attribute  relations.  For 
more  general  queries,  their  algorithm  [13]  serves  as  a 
heuristic . 

More  recently,  Bernstein  and  Chiu  [3]  classified 
queries  into  two  disjoint  classes:  tree  queries  and  cyclic 
queries.  They  showed  that  tree  queries  can  be  evaluated  with 
a  small  number  of  semi- joins  whilst  evaluating  cyclic 
queries  involves  more  elaborate  data  transfer. 

Since  tree  queries  can  be  answered  by  semi-joins,  it  is 
important  to  be  able  to  determine  whether  a  given  query  is  a 
tree  query  or  not.  An  algorithm  for  this  task  was  presented 
in  [3].  It  has  been  generalized  to  allow  more  than  one 
common  joining  attribute  between  any  two  relations  [4,  29]. 

The  queries  considered  in  [3,  4,  29]  involve  equi-join 

clauses  only . 

+  A  simple  query  is  a  conjunction  of  equi-join  clauses  such 
that  after  the  local  processing  each  relation  has  a  single 
attribute,  and  that  attribute  is  common  to  all  the  relations 
referenced  by  the  query. 
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1 . 3  Overview  and  Outline  of  the  Thesis 

In  this  thesis,  the  problem  of  determining  whether  a 
given  distributed  query  is  a  tree  query  or  not  is  studied  by 
considering  more  general  queries.  An  efficient  tree  query 
membership  algorithm  is  presented  for  queries  whose 
qualifications  are  conjunctions  of  join  and  selection 
clauses  with  relational  operators  {>,  >,  =  ,  <,  <} . 

Moreover,  there  is  no  restriction  on  the  number  of 
attributes  that  a  relation  involved  in  the  query  may  have. 

In  general ,  there  are  numerous  semi-join  sequences  that 
can  be  used  to  evaluate  a  given  tree  query,  and  these 
semi-join  sequences  may  differ  significantly  in  the  amount 
of  required  data  transmission.  Finding  the  optimum  sequence 
of  semi-joins  to  evaluate  tree  queries  is  an  important 
special  case  of  the  general  distributed  query  optimization 
problem  for  which  there  is  no  known  optimum  solution. 
Distributed  query  optimization  for  tree  queries  is  the 
second  problem  studied  in  this  thesis.  An  efficient  dynamic 
programming  algorithm  that  finds  the  optimum  sequence  of 
semi- joins  for  a  class  of  tree  queries  is  presented.  For 
some  tree  queries  the  optimum  sequence  of  semi- joins  is  also 
the  optimum  procedure  to  answer  the  query.  As  an  example 
consider  a  tree  query  such  that  fully  reducing  a  relation 
referenced  by  the  query  is  sufficient  to  answer  the  query 

and  R  resides  at  the  site  where  the  answer  of  the  query  is 

r 
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required.  Clearly,  the  optimum  sequence  of  semi-joins  which 
fully  reduces  R  is  also  the  optimum  way  to  ansv/er  such  a 
query.  However,  the  optimum  sequence  of  semi-joins  may  not 
be  optimum  among  all  possible  procedures  (utilizing  joins  as 
well  as  semi-joins)  to  answer  a  distributed  tree  query  in 
general.  Nevertheless,  the  optimum  sequence  of  semi-joins 
for  a  tree  query  is  likely  to  be  an  efficient  procedure  to 
answer  the  query  if  not  optimum. 


Chapter  2  gives  the 
that  will  be  utilized  in 
distinction  between  tree 
importance  of  semi-joins 


motivations  and  the  definitions 
the  remainder  of  the  thesis.  The 
and  cyclic  queries  and  the 
are  discussed. 


In  Chapter  3,  a  canonical  representation  of  a 
distributed  relational  database  query,  called  reduced  join 
graph,  is  introduced.  The  query  represented  by  a  reduced 
join  graph  does  not  contain  any  redundant  join  clauses,  and 
is  equivalent  to  the  original  query.  The  main  result  of 
Chapter  3  is  a  conceptually  simple  and  efficient  algorithm 
for  determining  the  tree  query  membership  of  a  distributed 
query,  which  may  involve  join  and  selection  clauses  with 
relational  operators  {>,  >,  =,  <,  j<)  .  For  tree  queries, 

the  presented  algorithm  produces  an  equivalent  query  for 
which  a  sequence  of  semi-joins  to  evaluate  the  query  can 
immediately  be  constructed.  An  efficient  implementation  of 
the  algorithm  is  also  given  together  with  its  time  and  space 
compl exi ties . 


. 
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Chapter  4  studies  the  distributed  query  optimization 
for  a  class  of  tree  queries.  The  tree  queries  considered  in 
this  chapter  are  conjunctions  of  equi-join  clauses  such  that 
no  two  relations  referenced  by  the  query  have  more  than  one 
join  attribute  in  common.  For  such  a  tree  query,  there  is  an 
exponential  number  of  semi-join  sequences,  each  of  which  can 
be  used  to  evaluate  the  query.  A  set  of  properties  are 
presented  that  must  be  satisfied  by  a  potentially  optimum 
strategy.  These  properties  help  to  discard  the  strategies 
that  can  never  be  the  optimum,  so  that  the  number  of 
strategies  to  be  considered  in  finding  the  optimum  is 
reduced  considerably.  A  recursive  algorithm  is  then 
presented  to  find  the  optimum  sequence  of  semi-joins.  Some 
practical  considerations  of  size  estimations  and  the  related 
difficulties  are  also  discussed.  Usually  strong  assumptions 
are  required  to  estimate  the  sizes  of  intermediate  relations 
which  are  produced  while  evaluating  a  query.  However  the 
optimization  results  of  Chapter  4  are  not  based  on  the 
method  used  for  size  estimation  and  are  applicable  to  any 
reasonable  size  estimation  method. 

Finally  Chapter  5  summarizes  the  significance  of  the 
results  obtained  in  this  thesis  and  discusses  possible 


avenues  of  further  research. 


■ 


CHAPTER  2 


SEMI -JOINS  AND  TREE  QUERIES 

2 . 1  Definition  and  Background. 

In  a  distributed  database,  data  is  usually  dispersed 
over  the  sites  of  the  database  in  a  redundant  manner  so  as 
to  improve  the  reliability  and  the  responsiveness  of  the 
overall  system  [19].  However  the  problem  of  maintaining  and 
managing  redundant  copies  of  the  data  is  considered  to  be  a 
different  problem  than  that  of  retrieving  distributed  data 
from  the  database  [26]  and  is  not  dealt  with  in  this  thesis. 
Hence,  throughout  this  thesis,  a  relational  distributed 
nonredunaant  database  is  assumed.  Furthermore,  each  relation 
is  assumed  to  reside  in  a  single  site,  i.e.  the  relations 
are  not  fragmented  among  several  sites,  and  the  network  is 
assumed  to  be  fully  connected.  These  two  assumptions  are 
also  common  to  [3,  4,  11,  13,  26]. 

The  local  processing  costs  are  assumed  to  be  negligible 
in  comparison  to  communications  cost,  and  the  communications 
cost  is  assumed  to  be  determined  by  the  amount  of  data 
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transmission.  The  aim  is  to  minimize  the  amount  of  data  to 
be  transmitted  between  sites  in  processing  a  distributed 
query . 


A  query  Q  is  assumed  to  be  a  conjunction  of  join 
clauses  of  the  form  (  R^ .  a  ©  R^.b)  where  a  and  b  are  the 
attributes  of  the  relations  R^  and  R ^  respectively  and  ©©{>, 
>,  =,  <,  £}  unless  it  is  stated  otherwise.  The  selection 

clauses  (e.g.  (R^.a  ©  constant))  are  omitted  since  such 

clauses  can  be  handled  by  local  processing.  The  exclusion  of 
disjunctions  is  also  justifiable  (for  the  tree  query 
membership  problem)  since  disjunctions  can  be  handled  by 
transforming  the  qualification  of  the  query  into  disjunctive 
normal  form,  and  treating  each  conjunction  as  the 
qualification  of  a  separate  query.  The  target  lists  are 
omitted  for  notational  convenience,  however,  the  analysis 
will  later  be  extended  to  allow  queries  with  target  lists. 
Other  query  constructs  such  as  quantifiers,  tuple  variables 
and  aggregate  functions  will  not  be  addressed  in  this 
thesis.  For  notational  convenience,  each  relation  referenced 
by  the  query  is  assumed  to  reside  at  a  different  site.  Since 
the  objective  is  to  minimize  the  amount  of  intersite  data 
transmission,  the  analysis  in  this  thesis  is  restricted  to 
the  class  of  queries  which  is  not  subject  to  local 


processing . 


' 
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2.2  Representation  or  queries,  Join  Gr apns 

A  query  g  is  representea  by  a  join  grapn,  J0^,  which  is 
a  airectea  grapn  whose  vertices  are  tne  join  attributes 
involved  in  g  ana  whose  arcs  are  the  join  clauses  ot  g. 

Since  a<b  ana  a<b  are  the  same  as  b>a  ana  b>a  respectively, 
it  is  sufficient  to  consiaer  the  join  clauses  ot  the  form 
(R^.a  9  R^.b)  in  g,  wnere  0G|>,  >,  =  ,  ^ .  A  join  clause 
(R^.a  9  R^.b),  0S{>,  >,  =  }  ,  in  g  is  represented  by  an  arc 
from  R^.a  to  R^ . b  in  J0^  (the  operator  forms  a  special 

case).  To  aistinguish  between  join  clauses  with  aifferent 
relational  operators,  three  aifferent  types  of  arcs  are  usea 
in  type-1  arcs  representing  ">"  join  clauses,  type-2 

arcs  representing  ">"  join  clauses,  ana  type-3  arcs 

representing  join  clauses.  A  type-i  arc,  l<i<3,  in 

~  ~  g 

from  R^.a  to  R_^ .  o  is  aenoted  by  (R^.a,  R^.b)^.  note  that 
unlike  type-1  or  type-2  arcs  a  type-3  arc  aoes  not  nave 
airection,  i.e.,  (R..a,  R  . b ) ^  is  the  same  as  (R  .b,  R-.a)o, 

and  eitner  one  of  tnem  can  be  usea  to  represent  tne  join 
clause  (R^.a  ^  R^.b).  Since  (a=b)  is  tne  same  as  (a)b) .and 
.  (b>a)  ,  an  equi-join  clause  (R^.a  =  R^.b)  in  g  is 
representea  by  tne  two  type-1  arcs,  (R^.a,  R  .b)-^  ana 
(R^.b,  Ri.a)1,  in  J0^.  More  precisely,  the  join  grapn 
jg  (V,A)  of  a  query  g  is  a  laoeilea  airectea  grapn  wnere 


. 


■ 
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V  -  iR^.a  |  a  is  a  join  attribute  of  in  g} 

A  ~  Atype-1  0  Atype-2  0  Atype-3 

Atype-  i  =  R-,  -  ^ )  ±  I  (Ri  -  a.  >  Ryb)  is  in  g}  U 

{(Ri-a,  Rj.b)1#  (R  .b,  Ri.a)1l(Ri.a  =  Rj.b) 
is  in  g} 

Atype-2  =  ^Rp3'  R^.o)2l(Ri.a  >  R  .b)  is  in  g} 

Atype-3  =  l(Ki,a'  Rj • b ) 3 I (Ri.a  /  R  .b)  is  in  Q} . 

bcnema tically ,  a  type-1  arc,  (u,v)if  is  snown  as  u — >-v,  a 
type-2  arc,  (u,v).,  is  snown  as  u-»-v,  and  a  type-3 
arc,  (u,v)3,  is  snown  as  u-//-v.  An  example  of  a  query 
and  the  corresponding  join  grapn  is  given  in  Fig.  2.1.  For 
every  join  grapn  there  is  a  unique  query  corresponding  to  it 
if  the  ordering  of  clauses  in  the  query  is  not  taxen  into 
consideration.  Thus  there  is  a  one-to-one  correspondence 
between  queries  and  join  graphs. 

A  join  grapn  JG  corresponding  to  a  query  g  may  contain 

a  number  of  connected  components,  each  being  a  directed 

graph  with  at  least  two  vertices  and  containing  type-1, 

type-2  and/or  type-3  arcs.  It  is  possible  that  there  may  be 

two  or  more  oilterent  types  of  arcs  from  u  to  v  in 

wnere  u  and  v  are  any  two  distinct  vertices.  In  such  cases, 

all  of  cne  arcs  from  u  to  v  can  oe  replaced  oy  a  single  arc, 

namely  (u,v)2,  as  follows.  Suppose  both  arcs  (u,v)^  and 

(u,v).  are  in  JG  .  This  occurs  when  g  contains  botn  clauses 
x  g 

(u>v)  and  (u>v).  Since  g  is  a  conjunction  of  clauses,  (u>v) 
.and.  (u>v)  is  equivalent  to  (u>v) .  Thus,  it  J G^  contains 


' 


jpgl 


ly 


Qj_:  (SUPP.S#  =  ORD.S#  )  .a_nd  .  (ORD.S#  =  STOCK. S# 

.and . (ORD. P#  ^  STOCK . P# ) . and . ( ORD . Qty  >  STOCK. Qoh) 

(a)  A  Query  Q 


JG 


Q. 


SUPP.S# 


ORD.S#  STOCK.  S# 
(b)  Join  Graph  of  Q-. 


ORD. P# 

STOCK. P# 


ORD . Qty 
V 

v 

STOCK .Qoh 


QG1  : 


SUPP 


ORD  STOCK 

(c)  Query  Graph  of  Q, 


JG 


Q. 


SUPP.S# 


ORD.P# 


ORD.S#  STOCK. S# 

(d)  Transitive  closure  of  JG 


STOCK. P# 

Qi 


ORD. Qty 


STOCK. Qoh 


Figure  2.1.  A  Query,  its  Query  Graph  and  Join  Graph. 


botn  a  type-i  arc (u ,v)  ana  a  type-2  arc  (u,v),  tne  type-1 
arc  (u,v)  is  reaunaant  ana  can  oe  eliminatea.  similarly,  if 
contains  botn  arcs  (u,v)2  ana  (u,v)2  tnen  (u,v)2  is 
reaunaant  ana  can  oe  eliminatea.  On  tne  other  nana,  it  Jo 
contains  ootn  arcs  (u,v)^  ana  (u,v)2  tnen  (u,v)2  ana  (u,v)2 
can  be  replaced  by  tne  arc  (u,v)2  since  (u>v) .ana.  (u/v)  is 
equivalent  to  (u>v).  In  general,  a  type-2  arc  (u,v)2  is  saia 
to  aominate  type-1  or  type-3  arcs  trom  u  to  v,  ana  if  there 
are  two  or  more  different  arcs  in  JG  from  u  to  v  then  all 
of  the  arcs  from  u  to  v  are  replacea  by  the  dominating  arc, 
(u,v)2»  Thus,  with  no  ioss  of  generality,  for  any  two 
aistinct  vertices,  u,v,  in  a  join  graph,  J'G^,  there  is  at 
most  one  arc  from  u  to  v. 

A  path  in  Jo^  trom  vertex  u  to  vertex  v  is  a  sequence 
or  aistinct  arcs  a^,...,a  ,  p  >  1,  such  that  there  exists  a 
corresponaing  sequence  of  vertices  u=vw , v^ , • •  •  , v  =v 

satisfying  ak+1=(vR,  vK  +  l^t  1S  in  ^g  tor  ^  ^  -  P-1'  wtiere 

l<t<3  for  p=l,  ana  l<t<2  tor  p>2.  Tne  motivation  benina  the 

definition  of  a  path  in  JG^  is  to  establish  tne  join  clauses 

wnicn  may  or  may  not  be  in  g  but  are  impliea  by  Q.  Suppose 

JG,,  contains  tne  arcs  (u,w),  ana  (w,vK,  where  u,v,w  are 

tnree  aistinct  vertices  in  JG„.  The  conjunction  of  tne  join 

g 

clauses  corresponding  to  these  arcs,  i.e.,  (u>w) 

.ana. (w^v) ,  aoes  not  imply  any  relation  between  u  and  v. 
Thus,  tne  arcs  (u,w)1  ana  (w,v)2  ao  not  form  a  path  from  u 
to  v.  In  general,  no  type-o  arc  can  be  in  a  patn  of  length 


’ 
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>  2  in  JG^.  A  path  is  called  a  type-1  patn  if  all  tne  arcs 
in  the  patn  are  type-1  arcs.  It  there  are  one  or  more  type-2 
arcs  in  a  patn  of  length  >  2  tnen  the  patn  is  called  a 
type-2  path.  Tne  type  of  a  patn  of  length  1  in  JG^  is  tne 
type  of  tne  only  arc  in  the  path.  Thus,  a  path  of  length  >  2 
is  eitner  type-1  or  type-2. 

A  cycle  in  JG^  is  a  path  of  iengtn  at  least  2,  which 
begins  and  ends  with  tne  same  vertex.  Observe  tnat  a  self 
loop,  i.e.  an  arc  of  tne  form  (u,u)  in  jg^,  is  not  a  cycle. 
Since  any  join  clause  in  0  corresponding  to  a  self  loop  in 
JG^  is  redundant,  we  consider  queries  which  do  not  have  self 
loops  in  tneir  join  grapns.  A  join  graph  is  called  acyclic 
in  it  contains  no  cycles. 

Since  0  is  a  conj unction  of  join  clauses,  and  >  is 
transitive,  a  type-1  path  in  Jb^  from  u  to  v  implies  that 
tne  join  clause  (u>v)  must  be  satisfied.  Similarly,  by 
transitivity,  a  type-2  patn  in  J  b^  from  u  to  v  implies  tnat 
tne  join  clause  (u>v)  must  be  satisfied.  It  is  clear  tnat  no 
join  clause  of  tne  form  (u^v)  can  be  implied  in  tnis  manner, 
however,  a  type-1  patn  of  Iengtn  >  2  from  u  to  v  and  the  arc 
(u,v)~  implies  tnat  tne  join  clause  (u>v)  is  to  be  satisfied 
(oy  transitivity  and  dominance).  For  notational  convenience, 
we  call  a  type-i  patn,  u=w^,w1,...,wn=v,  witn  tne  arcs 
(wi_1,wi)i,  l<i<n,  n> 2 ,  in  JG^,  an  implied  type-2  patn  from 
u  to  v  li  (ws ,  wt )  3  1S  in  wnere  l<s,t<n,  2<  |  s-t  |  <n .  mow 

consider  a  cycle  in  Jb^  incident  witn  two  vertices  u  ana  v. 
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one  patn  trom  u  to  v  implies  u>v  ana  the  other  from  v  to  a 
implies  v>u.  ir  one  or  more  arcs  in  the  cycle  are  type-2 
arcs  then  at  least  one  of  the  inequalities  is  a  strict  one; 
a  contr aa 1 ct ion .  Thus,  it  is  sufficient  to  consiaer  tnose 
queries  whicn  have  no  type-2  arcs  in  cycles.  Since  u>v  ana 
v>u  implies  u=v,  a  cycle  in  JG^  represents  a  conjunction  of 
join  clauses  of  tne  form  (u=v)  between  every  pair  of 
vertices,  u,v,  incident  with  tne  cycle.  Clearly,  a  type-3 
arc  in  JG^  between  any  two  vertices  tnat  are  inciaent  witn 
the  same  cycle  is  a  contradiction,  i.e.,  (u/v) .and.  (u=v) . 

Thus,  it  is  sutficient  to  consiaer  those  queries  that  have 
oniy  type-l  arcs  within  cycles  in  their  join  graphs. 

2.3  Transitive  Closure  of  Join  graph 


Let  jb^(v,A)  De  the  join  grapn  of  a  query  Q  where  V  is 

the  set  of  vertices  and  A  is  tne  set  of  arcs,  inrormaliy, 

T 

tne  transitive  closure,  JG^,  or  is  a  join  grapn  that 

represents  all  tne  join  clauses  tnat  are  in  g  or  implied  by 

G .  Let  u  ana  v  be  two  vertices  in  JG^.  A  type-1  arc  (u,v)^ 

is  said  to  be  implied  oy  G  if  tnere  is  a  type-1  patn  in 

from  u  to  v  sucn  that  the  patn  does  not  use  any  arc  trom  u 

to  v.  Similarly,  a  type-2  arc  (u,v)^  is  implied  by  G  if 

there  is  a  type-2  (or  implied  type-2)  patn  in  JG^  from  u  to 

v  sucn  that  the  patn  does  not  use  any  arc  trom  u  to  v. 

Moreover,  a  type-3  arc  (u,v)^  is  said  to  be  implied  by  Q  if 

jg  contains  an  arc  (u,w)-.,  for  some  other  vertex  w,  and  a 
G  u 

cycle  incident  with  w  ana  v. 
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Formally,  the  transitive  closure  of  JGQ  is  JGq(V,AT) 

T 

where  A  is  obtained  by  successively  adding  all  implied  arcs 

to  A  and  replacing  multiple  arcs  between  two  vertices  by 

dominating  arcs  (see  Fig.  2.1  and  Fig.  3.1).  Observe  that 

for  any  arc  a=(u,v)^,  l<i<3,  in  JG^,  JG^  contains  exactly 

one  arc  from  u  to  v,  which  either  is  the  same  as  a  or 

T 

dominates  a.  JG^  does  not  contain  any  self  loops. 


2 . 4  Equivalence  of  Queries 


Two  queries,  and  Q^,  are  equivalent  if  their 

answers*  are  the  same  irrespective  of  the  contents  of  the 
relations  [3].  Obviously,  two  queries  having  the  same  join 
graph  are  equivalent.  The  following  lemma  gives  the 
necessary  and  the  sufficient  condition  for  the  equivalence 
of  queries  in  terms  of  join  graphs. 


T  T 

Lemma  2.1;  is  equivalent  to  Q2  iff  JG^  =  JGq  . 

T  T 

Proof:  Let  JG^  =  JG0  ,  and  Q(  denote  the  query 

U1  U2 
rn 

represented  by  JGX  .  From  the  definition  of  transitive 

U1 

closure,  any  clause  which  is  in  but  not  in  is 

implied  by  .  But  contains  all  the  join  clauses  in 

except  those  which  are  dominated.  Thus  is  equivalent  to 

T 

Q|.  Similarly,  is  equivalent  to  Q2  since  JGQ 


+In  general,  the  answer  of  a  query  is  the  set  of  tuples 
described  by  the  qualification  and  the  target  list  of  the 
query.  Since  queries  under  consideration  do  not  have  target 
lists,  the  answer  of  a  query  in  this  context  refers  to  the 
set  of  tuples  (of  all  the  relations  referenced  by  the  query) 
satisfying  the  qualification. 


. 

. 
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-  JGg  •  Therefore  Q-^  is  equivalent  to  Q 
^  ,T  T 

Suppose  JG  ^  JG„  .  Since  each  connected  component 
yl  y2 

in  a  join  graph  contains  at  least  tv/o  vertices  and  is 

T  T 

connected,  JG  and  JG  must  differ  on  at  least  one 

yl  y2 

arc.  This  implies  the  existence  of  a  join  clause  satisfied 
by  the  qualification  of  one  but  not  the  other  query. 
Therefore,  the  answers  of  the  two  queries  can  not  be  the 
same,  irrespective  of  the  contents  of  the  relations,  so 

is  not  equivalent  to  •  # 

2 . 5  Efficiency  of  Semi-joins 

For  an  equi-jom  query  Q,  R^semi- join^R j  can  be 

computed  by  transferring  only  the  projection  of  Rj  over  its 

attributes  that  appear  in  q  [3],  where  q  is  the  join 

qualification  over  R.  and-  R-.  If  q  contains  join  clauses 

l  j  J 

as  well  as  equi-join  clauses  then  it  is  evident  that  the 

data  transfer  required  to  compute  R. semi-join  R-  is  also  the 

1  l - 1 - q  3 


projection  of  R^  over  its  attributes  in  q. 


Suppose  there  is  only  one  join  clause  between  R^  and  R^ 
in  Q.  Let  this  clause  be  (R^.a  >  R^.b)  or  ( R^ . a  >  R^.b). 

Then  it  is  easy  to  verify  that  the  data  transfer  required  to 
compute  R^semi- j oin^R j  is  only  the  minimum  value  in  the 
column  b  of  R ^ .  Similarly,  in  order  to  compute 
R. semi-join  R-,  it  is  sufficient  to  transfer  only  the 
maximum  value  in  column  a  of  R^.  However,  if  there  are 
several  attributes  of  R^  appearing  in  the  join  qualification 


. 
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q  tnen,  in  oraer  to  compute  R  ■ semi -j o  in^R  ,  it  may  oe 

-  Si  J 

necessary  to  transfer  tne  projection  of  over  its 
attributes  in  q  to  tne  site  of  R^.  Thus,  for  queries 
involving  join  clauses  with  relational  operators  {>,  >,  =  , 
tne  data  transmission  required  to  perform  R^semi-j oin  Rj 
is  restricted  to  the  projection  of  R^  over  its  attributes  in 

q- 


Clearly,  a  semi-join  (unlike  join)  always  reduces  the 

size  of  tne  relation  on  which  it  is  performed.  That  is,  the 

number  of  tuples  in  tne  relation  resulting  after 

R ■ semi-j oin  R_  is  equal  to  or  less  tnan  tne  numoer  of  tuples 
^  M  J 

in  R^ .  Consequently,  semi-joins  can  be  effectively  useo  to 
reduce  tne  amount  of  data  transmission  in  processing  a 
distributed  query,  furthermore  tree  queries,  to  be  defined 
next,  can  be  answered  using  only  semi-joins  [ 3 J  . 

2 . 6  yuery  Graphs,  Tree  and  Cyclic  Queries 

Given  the  join  graph  JG^  of  a  query  y,  we  want  to 
examine  whether  it  is  possible  to  answer  y  by  using 
semi-joins.  Tnis  is  tacilirated  by  defining  a  query  grapn. 
The  query  grapn  yG  of  a  given  query  y  is  an  undirected  graph 
wnose  vertices  are  tne  relation  names  referenced  in  y,  and 
whose  edges  indicate  tne  data  transfer  between  the 
relations,  bore  formally,  tne  query  grapn  yG(V,h)  [2y]  or  a 


query  y  is  an  undirected  grapn  wnere 


' 
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^  —  set  of  all  relation  names  referenced  in  Q, 

R  =  HR^,Rj)  |  i^j  and  there  is  an  arc  in  JG^,  between  some 

attribute  of  R  and  some  attribute  of  R.}. 

1  3 

By  definition  no  self  loop  (i.e.  ,  an  arc  of  the  form 
(Ri'  Ri^  exists  in  a  query  graph.  An  example  of  a  query 
graph  is  given  in  Fig.  2.1. 


For  the  remainder  of  the  thesis,  it  is  assumed  that 
every  query  has  a  connected  query  graph.  This  is  not  a 
restrictive  assumption  since  otherwise  the  query  is  a 
conjunction  of  subqueries  (whose  query  graphs  correspond  to 
the  connected  components  in  the  query  graph)  each  of  which 
can  be  processed  independently  [3].  Furthermore,  if  the 
query  graph  of  Q  is  connected,  then  any  query  equivalent  to 
Q  also  has  a  connected  query  graph,  as  is  shown  by  the 
following  lemma. 


Lemma  2.2;  Let  Q  be  a  query  whose  query  graph,  QG,  is 
connected.  Then  the  query  graph  of  any  query  that  is 
equivalent  to  Q  is  also  connected. 

Proof :  Consider  any  edge  (R^,  R^)  in  QG.  The  join  graph  JGq 

of  Q  must  contain  at  least  one  arc  incident  with  some 


attribute  of  R  and  some  attribute  of  R-.  Let  this  arc  be 

i  3 

(R  .a,  R..b)v ,  l<k<3.  Let  Q-,  be  a  query  equivalent  to  Q,  and 

1  J  K  -L 

JG  and  QG-,  denote  its  join  graph  and  query  graph 
^1 

T  T 

respectively.  From  Lemma  2.1,  JG^  =  JGq  ,  so  JG^ 


contains  at  least  one  path  whose  end  vertices  are  R^.a 


and 


R  .b.  Hence,  QG^  contains  a  path  from  Ri  to  R ^ .  This  shows 


2  7 


that  it  any  two  vertices  are  adjacent  in  GG  then  they  are 
connected  in  gg^.  bince  both  QG  and  QG-^  nave  the  same  set  of 
vertices  and  gG  is  connected,  this  implies  gg-^  is  also 
connected.  # 

For  equi-join  queries,  it  was  shown  in  [3]  tnat  if  the 

•k 

query  graph  of  Q  is  a  tree  then  Q  can  be  answered  by 
semi-joins.  This  result  has  a  direct  extension  to  queries 
involving  join  clauses  with  {>,  =,  >,  ^}.  In  addition,  if 
tne  query  grapn  of  Q  is  cyclic,  it  may  be  possible  to 
transform  G  into  an  equivalent  query  such  that  the 
equivalent  query  nas  a  tree  query  grapn. 

A  query,  g?  is  called  a  tree  query  if  eitner  Q  itself 
or  a  query  equivalent  to  G  has  a  tree  query  grapn.  All  other 
queries  are  called  cyclic  queries  L 3 J  .  Thus  a  query  can  be 
one  of  tne  two  types ;  a  tree  or  a  cyclic  query.  The  set  of 
all  tree  queries  are  denoted  by  TQ  and  all  cyclic  queries  by 
CG.  Tree  and  cyclic  queries  are  illustrated  in  Fig.  2.2. 


A  query  Q  witnout  a  target  list  is  said  to  be  answered  if 
for  every  relation  referenced  in  Q,  the  set  of  tuples 
satisfying  (the  quali tication  of)  g  is  found.  If  Q  has  a 
target  list  then  it  is  sufficient  to  find  only  those  tuples 
tnat  are  described  in  tne  target  list  and  satisfy  tne 
qualification  of  G*  hence  a  query  with  a  target  list  can 
also  be  answered  by  semi-joins  n  its  query  grapn  is  a  tree. 


Q2:  (SUPP.S#  =  ORD.S#) .and. (SUPP.S#  =  STOCK. S# ) 
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.and. (ORD . P#  ^  STOCK. P# ) .and. (ORD.Qty 
JGq  :  SUPP.S#  ORD.P# 


ORD . S#  STOCK. S#  STOCK. P# 

QG2:  SUPP 


(a)  A  Tree  Query  (Equivalent  to  in 
Cyclic  Query  Graph. 


Q3*.  (PART.  City  =  SUPP.  City)  .and.  (PART.  P# 
.and. (SUPP.S#  =  ORD.S#) 

JG  :  PART. City  PART.P# 

1  f  0 

SUPP. City  ORD.P# 

QG3 :  SUPP 


(b)  A  Cyclic  Query. 


>  STOCK. Qoh) 
ORD . Qty 

STOCK. Qoh 


ig .  2.2)  with  a 

ORD .P# ) 

SUPP.S# 


ORD.S# 


Figure  2.2.  Tree  and  Cyclic  Queries. 


CHAPTER  3 


DETERMINING  TREE  QUERY  MEMBERSHIP 
OF  A  DISTRIBUTED  QUERY 

3 . 1  Introduction 

Clearly,  if  the  query  graph  of  a  given  query  is  a  tree 
then  the  query  is  a  tree  query.  However,  a  query  with  a 
cyclic  query  graph  may  also  be  a  tree  query  (e.g.  query  Q ^ 
in  Fig.  2.2).  Thus,  a  procedure  is  required  to  determine  the 
type  of  a  query  in  general .  For  a  query  Q,  a  procedure  which 
enumerates  all  the  queries  equivalent  to  Q  and  checks 
whether  any  of  their  query  graphs  is  a  tree,  would  be 
sufficient  to  determine  the  type  of  Q.  However,  such  an 
exhaustive  enumeration  can  be  avoided  by  using  a  canonical 
representation  of  a  query,  called  a  reduced  join  graph. 

In  this  chapter,  first  the  reduced  join  graph  is 
introduced  then  an  efficient  tree  query  membership  algorithm 
is  given  for  conjunctive  queries  with  relational  operators 
{>,  >,  =,  7^,  <,  <}  .  For  tree  queries,  the  algorithm  also 

produces  an  equivalent  query  together  with  its  tree  query 


-29- 


3  u 


grapn  so  tnat  tne  sequence  of  semi-joins  to  evaluate  tne 
query  can  lmmeaiateiy  be  constructed.  An  implementation  of 
tne  algoritnm  ana  its  time  ana  space  complexities  are  given. 

3 . 2  Reauced  Join  Grapn 

Intuitively,  the  reaucea  join  graph  of  a  query  g  is  a 
join  graph  representing  an  equivalent  query  which  has  the 
fewest  number  of  join  clauses  among  all  queries  equivalent 
to  g.  Such  a  join  graph  is  obtained  by  first  grouping 
vertices  into  equivalence  classes  (where  any  two  vertices 
are  in  an  equivalence  class  iff  tnere  is  a  cycle  in  tne  join 
grapn  incident  with  both  vertices),  ana  transforming  the 
arcs  within  each  equivalence  class  into  a  set  of  arcs 
representing  only  equi-join  clauses;  ana  then  from  the  set 
of  arcs  inciaent  with  vertices  in  aifterent  equivalence 
classes,  removing  all  redunaant  arcs  ana  replacing  tne 
aominatea  arcs.  In  somewnat  more  detail,  tne  construction  of 
tne  canonical  join  grapn  consists  of  the  following  two 
s  teps . 

(i)  The  construction  of  equivalence  classes  and 
cor responai ng  spanning  trees:  Let  C  be  a  connected  component 
in  JG^.  As  aiscussed  previously  (Section  2.2),  a  cycle  in  C 
represents  a  conjunction  of  equi-join  clauses  of  the  form 
(u=v)  for  every  pair  of  vertices,  u,  v,  inciaent  with  the 
cycle.  Thus  vertices  in  a  connectea  component  can  be 
partitioned  into  vertex  equivalence  classes  such  that  any 
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two  vertices  are  in  tne  same  equivalence  class  iff  tnere  is 
a  cycle  incident  with  botn  vertices,  ana  a  vertex  forms  an 
equivalence  class  cy  itself  iff  it  is  not  incident  with  any 
cycle  in  C.  Consequently,  the  set  of  arcs  within  each 
multimember  vertex  equivalence  class  can  be  replaced  by  a 
spanning  tree  of  tne  vertices  in  the  equivalence  class, 
wnere  each  "eage"  (u,v)  of  tne  spanning  tree  consists  of 
doth  arcs  (u,v)^  ana  (v,u)^.  Since  tnere  is  no  self  loop  in 
C,  it  is  clear  that  there  is  no  arc  within  a  single  member 
vertex  equivalence  class. 

(11)  Tne  removal  of  redundant  arcs:  Tnis  proceeds  by 

tne  construction  of  a  condensed  acyclic  grapn  AC(V^,A^)  tor 

each  connected  component  in  tne  original  join 

grapn.  Eacn  vertex  equivalence  class  in  C  is  condensed  to  a 

vertex  in  V^.  The  arcs  in  AC  are  determined  by  the  arcs  in 

C,  as  toliows.  For  any  two  equivalence  classes  E^,  E ^ ,  if j , 

let  S.  be  the  set  of  arcs  in  C  from  vertices  in  E,  to 
i  /  3 

vertices  in  E ^ ,  i.e., 

b  •  =  {  (u,v).  I  (u , v )  GA0 ,  uGE . ,  vGE .  and  1<K<3  }. 

1  ,  J  K  >\  Z  1  J 

Ir  ^  ^0  tnen  tnere  is  exactly  one  arc  in  AC  of  the  form 

1/3 

(Ei,E^);  otnerwise  there  is  no  arc  from  E ^  to  E  ^  in  AC. 
buppose  b.  *0.  Tne  arc  in  AC  from  E  •  to  E  is  a  type-1 

1  t  J  J 

arc  ii  all  tne  arcs  in  b.  are  type-1  arcs;  it  is  a  type-3 

i/3 

arc  it  all  tne  arcs  in  b.  are  type-3  arcs;  ana  it  is  a 

i/3 

type-z  arc  otnerwise.  Tne  resulting  condensed  grapn,  AC,  is 
acyclic  since  any  cycle  in  C  must  be  witnin  an  equivalence 
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class,  which  in  turn  is  a  single  vertex  in  AC.  The  arcs  in 
tne  transitive  closure  of  AC  are  successively  examined,  in 
any  order,  and  tnose  wnich  are  implied  oy  transitivity  are 
removed.  The  resulting  grapn,  denoted  AC1",  is  called  tne 
transitive  reduction  (see  Appendix  a)  of  the  acyclic  graph 
AC.  (Notice  that  tne  transitive  reduction  in  our  context  is 
an  extension  of  [1J  tor  directed  graphs  containing  three 
types  of  arcs.  Appendix  A  formalizes  the  construction  of  tne 


transitive 

reduction. 

)  Finally, 

each  vertex  i 

n  Vx 

is 

expanded 

by 

replacing 

it  with  a 

spanning  tree 

Of 

the 

vertices 

in 

the  equivalence  class  correspondi 

ng  t 

o  that 

vertex. 

The 

connect  ed 

component 

of  a  reduced 

join 

graph 

corresponding  to  the 

connected 

component  C  of 

JGQ 

,  is 

oot aineo 

by 

expanoi ng 

every  vertex  in  AC1"  in 

this 

manne 

Thus,  a  reduced  join  grapn  for  Q  is  obtained  by 
performing  steps  (i)  and  (ii)  for  every  connected  component 
in  Jb^(see  Fig.  3.1). 

Since  tne  construction  of  tne  reduced  join  grapn  is 
cased  on  tne  transitive  reduction  of  an  acyclic  grapn,  we 
will  utilize  tne  properties  of  tne  transitive  reduction  in 
tne  following  analysis.  Those  properties  are  stated  in  lemma 
3.1  below . 


* 

■ 


33 


(  R^  .  a  = 

R0 . a )  . and . (  R^  .  a 

> 

.  a ) 

. and . ( R0 . a  > 

R2  •  a) 

(  R2  •  a  > 

R^  .  a  )  . and. (  R^  .  a 

= 

R5  .a) 

. and. ( Rn . a  > 

R4  -  a) 

(  R2  .  b  > 

R0 . a ) . and . ( R0 . b 

R4.a) 

(a)  A  Query  Q 


(b)  A  Join  Graph  JG 


(c)  Vertex  Equivalence  Classes 


(d)  Acyclic  Graph  of  JG 


(e)  Transitive  Reduction  of 
the  Acyclic  Graph  in  (d) 


Figure  3.1.  Construction  of  Reduced  Join  Graph 
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(g)  Transitive  Closure  of  the  Join  Graph  of  Q 
Figure  3.1.  Construction  of  Reduced  Join  Graph  (cont'd) . 
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Lemma  3.1:  The  transitive  reauction,  Gfc,  of  a  finite  acyclic 

join  grapn  G  satisfies 
t  t  T 

(1)  (Gc)  =  G1 ,  ana 

1'  T  t 

(n)  If  H  =  G  tnen  for  any  arc  (u,v)^  in  G  there  is  an 

arc  (u,v)  in  H,  l<i,j<3. 

Pr oof :  Directly  follows  from  Proposition  A.l  in  Appendix 
A.  # 


Thus,  for  a  given  acyclic  join  graph,  g,  any  join  graph 

having  tne  same  transitive  closure  as  G  contains  at  least  as 

many  arcs  as  the  transitive  reduction,  gc,  of  G.  Moreover, 

if  tne  types  of  the  arcs  are  ignorea  then  G*1  c  h  for  any  n 
T  T 

satistying  h  =G  . 


Tne  reduced  join  graph  obtained  from  a  join  grapn  Jg^ 

is  not  unique  unless  every  vertex  equivalence  class  in  JG^ 

is  a  single  member  equivalence  class.  For  example,  consider 

any  two  reauced  join  grapns  JG^  and  JG^  obtained  from  a 

given  join  graph  JG^.  Clearly  JG^  and  JG2  have  the  same  set 

of  vertex  equivalence  classes.  Tne  corresponding  vertex 

equivalence  classes  in  JG-^  and  JG^  have  the  same  set  of 

vertices  but  the  spanning  trees  of  vertices  in  the  same 

equivalence  class  may  not  be  the  same.  There  is  an  arc 

(ui,v1)k,  l<k<3,  in  JG^  where  u^t^,  v-^eE.  ,  i^j  ,  iff  there 

is  a  corresponding  arc  (u^v^)^  ln  satisrY;i-n9  u2ehi' 

v,6E  .  nowever  these  arcs  may  differ  in  their  end  vertices, 
^  D 

i.e.,  u-^u,  or  v  ,^v7  is  possiole.  The  following  lemma  shows 
tnat  any  reduced  join  grapn  of  a  query  Q  represents  a  query 


■ 
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which  is  equivalent  to  Q. 


Lemma  3.2:  If  JG  is  a  reduced  join  graph  for  Q  then 

T  T 

JG1  =  JGq. 

P r oof :  By  construction,  JG  and  JG^  have  the  same  set  of 
vertex  equivalence  classes,  and  the  condensed  acyclic  graph 
of  JG  is  the  transitive  reduction  of  the  condensed  acyclic 
graph  of  J  G^  .  Thus,  from  Lemma  3.1,  the  condensed  acylic 
graphs  of  the  join  graphs  JG  and  JG^  have  the  same 
transitive  closure.  Since  it  is  straightforward  to  show  that 
any  two  join  graphs  have  the  same  transitive  closure  if  they 
have  the  same  vertex  equivalence  classes  and  the  transitive 
closure  of  their  condensed  acyclic  graphs  are  the  same,  it 
follows  that  J G^=  JG^.  # 


For  any  query  Q,  let  AJ(Q)  denote  the  set  of  al 1  join 
graphs  having  the  same  transitive  closure  as  that  of  the 
join  graph  of  Q,  i.e.  AJ(Q)  =  {  JG  |  JGT  =  JG^  }.  Clearly, 

AJ ( Q )  represents  all  queries,  including  Q  itself,  which  are 
equivalent  to  Q.  By  definition,  if  there  is  a  join  graph  in 
AJ(Q)  whose  query  graph  is  a  tree  then  Q  is  a  tree  query, 
otherwise  it  is  a  cyclic  query.  Let  RJ(Q)  denote  the  set  of 
all  reduced  join  graphs  for  Q.  From  lemma  3.2, 

RJ ( Q )  C  A J ( Q ) .  The  following  lemma  shows  that  in  order  to 
determine  the  type  of  a  query  Q  it  suffices  to  consider  only 
the  join  graphs  in  the  subset  RJ(Q).  This  justifies  that  a 
reduced  join  graph  is  a  canonical  representation  of  a  query. 


I  ii 
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Lemma  3.3:  y  fe  Ty  iff  there  is  a  join  grapn  Jb  in  Rj (Q)  sucn 
that  the  query  grapn  of  JG  is  a  tree. 

Fr_oof:  The  lt-part  follows  immediately  since  RJ(y)  C  AJ(Q). 
For  the  only-i  f-par  t ,  let  JG^  be  a  join  graph  in  AJ(Q)  wnose 
query  grapn,  yGy ,  is  a  tree.  Tnen  a  reaucea  join  graph  JG 
can  be  constructed  from  JG^  such  tnat  the  set  of  edges  in 
tne  query  graph,  QG,  of  Jb  is  a  subset  of  that  in  QG0.  Both 
yG  ano  QG^  have  the  same  set  of  vertices  and  since  QG^  is  a 
tree,  from  lemmas  3.2  ano  2.2,  yG  is  connected  and  is  also  a 
tree.  # 

In  oroer  to  determine  the  type  of  a  query  y,  a  reduced 
join  grapn  obtained  from  JG^  can  be  utilized  witn  no  loss  of 
generality  as  is  snown  by  tne  preceding  lemma,  however  some 
of  trie  vertices  and  arcs  in  a  reduced  join  graph  can  be 
eliminated  by  local  processing.  Consider  a  spanning  tree 
corresponding  to  a  vertex  equivalence  class  in  a  reduced 
join  grapn.  Eacn  edge  in  the  spanning  tree  represents  an 
equi-join  clause.  Equi-join  clauses  between  the  attrioutes 
of  the  same  relation  can  be  eliminated  as  a  result  of  local 
processing.  Thus,  if  there  is  a  join  attribute  of  a  relation 
in  a  veriex  equivalence  class  then  the  other  attributes  of 
tne  same  relation  are  assumed  to  be  absent  from  that  vertex 
equivalence  class. 

For  simplicity,  we  say  tnat  a  connected  component,  C  , 

contains  R  it  there  is  at  least  one  attribute  of  R-  in  C,  . 

— — — — —  l  i 

Similarly,  an  equivalence  class  contains  R1  if  an  attribute 
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of  is  in  tnat  equivalence  class,  toe  also  say  that  is 

adjacent  to  (from)  R . ,  if  tne  arc  (R..a,  R  .b) 

J  i  J  t 

((R_.b,  R-.a),)  is  in  C,  wnere  R. .a  ana  R • . o  are  some 
J  XL  K  1  J 

attributes  of  Ri  and  R^  respectively,  and  l<t<3.  It  is 
assumed  tnat  each  connected  component  in  JG  contains  at 
least  two  relations,  Decause  otnerwise  all  tne  join  clauses 
in  tne  connected  component  couid  oe  perrormea  by  local 
processing,  ana  hence  tnat  connected  component  could  be 
eliminated  trom  JG.  The  next  section  is  concerned  witn 
determining  tne  type  of  a  query  y  using  a  reduced  join  grapn 
for  y . 

3.3  A' General  Tree-query  Memoersnip  Algorithm 

The  approach  we  use  in  determining  the  type  of  a  given 
query  is  similar  to  the  one  described  in  [29J.  There,  the 
relations  and  the  corresponding  clauses  in  a  query  are 
successively  examined  and  those  wnose  removal  ao  not 
influence  the  type  of  the  query  are  eliminated.  If  all  the 
relations  ana  the  join  clauses  can  be  eliminated  in  tnis 
manner  (resulting  in  a  null  join  grapn  ana  a  null  query 
grapn)  tnen  tne  final  query  ana,  more  importantly,  the 
original  query  are  tree  queries,  on  tne  otner  nana,  it  some 
relations  ana  join  clauses  remain  tnen  ootn  tne  resulting 
ana  the  original  queries  are  cyclic  queries. 

Tne  procedure  nere,  in  somewhat  more  detail,  is  as 
follows.  Consider  a  reduced  join  grapn,  JG,  of  a  query  after 
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aii  tne  local  processing  aescrioea  in  the  preceding  section 
is  performed.  Let  R-^,...,E  denote  the  vertex  equivalence 
classes  in  JG.  Associated  with  each  relation,  R^,  two  sets 
and  S ^  are  defined  as  the  set  of  multimember  equivalence 
classes  and  tne  set  of  single  member  equivalence  classes 
containing  k^  respectively.  That  is,  FL  =  {£  |  contains 

R^  and  I  h.  t  |  >  1  >  and  =  {Et  |  contains  R ^  and  |E  |=1}, 
l<i<n.  Let  R^  and  be  any  two  relations  involved  in  y, 

i^J ,  1  <  i,j  <  n.  R^  is  said  to  be  s uperseded  by  R^  if 

( 1)  h.  c  ,  and 

4 

(2)  For  eacn  EtGS^,  every  vertex  E^  that  is  adjacent  to 
or  from  Efc  in  the  condensed  acyclic  graph  of  JG 
satisfies  e,g  S  ■  U  M  L  S  . 

K  1  j  j 

To  illustrate  the  two  conditions  above,  suppose  R^  is 

superseded  by  R^ .  Consider  any  joining  attribute,  say  R^.a, 

of  R^ .  K^.a  is  eitner  in  a  multimember  equivalence  class  or 

in  a  single  member  equivalence  class.  In  tne  first  case, 

tnat  multimemoer  equivalence  class  also  contains  R  (rrom 

condition  1).  In  tne  latter  case,  every  vertex  v  that  is 

adjacent  to  or  from  k^.a  is  eitner  in  an  equivalence  class 

wnicn  also  contains  R.  or  v  is  another  attribute  of  R.  ano 

J  i 

it  is  in  a  single  vertex  equivalence  class  (trom  condition 
2).  Clearly,  the  same  conditions  are  satisfied  by  every 
attribute  of  R^  . 

It  ki  is  superseded  by  tnen  it  will  be  shown  (lemmas 
3.5  and  3 . b )  that  in  order  to  determine  the  type  of  y  it 
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sufrices  to  consider  only  a  suoset  or  ail  reduced  join 
grapns  tor  w  since  Q  fe  Ty  iti  tnere  is  a  join  grapn  in  the 
subset  whose  query  grapn  is  a  tree.  Furthermore,  every  join 
graph  in  the  subset  has  the  property  that  all  tne  arcs 
incident  with  attributes  of  k^  in  such  a  join  graph 
correspond  to  a  single  edge,  (Ri,kj),  in  the  query  graph. 
That  is,  the  query  graph  of  any  join  grapn  in  the  subset  has 
exactly  one  edge  incident  on  R_^.  Consequently,  any  cycle  (if 
one  exists)  in  such  a  query  graph  can  not  be  incident  with 
Fb  ,  and  hence  removing  Fh  will  not  arfect  tne  type  of  Q 
(prop.  3.1).  Ri  is  removed  from  y  by  eliminating  all  the 
attributes  of  R^  and  the  arcs  incident  with  them  from  a  join 
grapn  in  tne  subset.  In  tne  remaining  join  graph  there  may 
be  some  connected  components  each  containing  a  single 
relation.  Those  connected  components  can  be  removed  as  they 
do  not  contribute  any  edges  to  the  query  graph  ano  therefore 
nave  no  influence  on  tne  type  of  y .  however  tne  removal  of 
Ri  and  tne  subsequent  removals  of  single  relation  connected 
components  (it  tney  exist)  may  cause  some  relations  to 
become  superseded  by  some  other  relations.  The  process  of 
elimination  continues  until  eitner  ail  relations  and  the 
corresponding  arcs  are  removed  (giving  a  null  join  grapn)  or 
some  relations  remain  such  that  for  every  pair  of  relations 
R^ ,  R^  in  tne  remaining  set,  R^  is  not  superseded  by  R^  and 
eacn  connected  component  contains  at  ieast  two  relations.  In 
tne  former  case,  y  is  a  tree  query  since  a  null  query  grapn 
is  a  tree  query  grapn  ano  eacn  removal  preserves  the  type  of 
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tne  query.  Furthermore,  tne  conjunction  of  join  clauses 
corresponding  to  the  set  of  removea  arcs  is  a  query 
equivalent  to  ^  ana  tne  query  grapn  or  this  query  is  a  tree 
(prop.  3.1).  in  tne  latter  case,  proposition  3.2  shows  that 
the  resulting  query  must  be  a  cyclic  query  which  in  turn 
implies  that  Q  is  also  a  cyclic  query. 


toe  now  proceea  to  snow  tne  above  statements.  Let  RJ(^) 
be  the  set  of  all  reducea  join  graphs  for  Q.  Suppose  R1  is 
superseded  by  ,  i^j ,  1  <  i,j  <  n.  A  mapping  A  is  defined 
wmcn  maps  connected  components  in  join  grapns  in  RJ  (q)  to 
connected  components  in  join  graphs  of  a  subset  of  RJ (g) , 
denoted  by  RJ(Q),  such  that  the  query  graph  of  any  join  graph 
in  RJ(U)  contains  the  edge  (R^,Rj)  out  does  not  contain  any 
edge  (R^,R^),  j .  M  leaves  all  arcs  which  are  not  incident 
with  R^  unchanged,  and  replaces  certain  arcs  incident  with 
R^  by  some  arcs  incident  witn  R^ .  Tne  mapping  M  has  the 
following  properties: 

(a)  If  fi,  is  a  vertex  equivalence  class,  so  is  M(£,  ), 

(o)  It  there  is  an  arc  between  two  vertex  equivalence 
classes,  M  either  leaves  the  arc  uncnangea  or 
replaces  it  by  another  arc  of  the  same  type  between 
the  same  two  equivalence  classes. 
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More  precisely,  M  is  described  as  toliows.  Let  Jb  be  a  join 
grapn  in  kJ (^)  ana  C  oe  a  connected  component  in  Jb.  C 
contains  one  or  more  vertex  equivalence  classes,  we  lirst 
describe  now  M  mooities  tne  arcs  witnin  equivalence  classes. 
Let  T,  denote  a  spanning  tree  corresponding  co  a  vertex 
equivalence  class  E^  in  C,  such  that  an  "edge"  (u,v)  in 
denotes  the  two  arcs  (u,v)  ana  (v,u).  is  one  of  the 
following  types: 

(1)  it  does  not  contain  R^, 

(2)  it  contains  R • ,  k.  and  tne  edge  (R-.a,  R.  .b)  is  in  T.  , 

1  3  1  J  K 

(3)  it  contains  R^,  R^  but  the  eoge  (R^.a,  R^.b)  is  not  in 

v  oc 

(4)  it  contains  R^  but  not  R  , 

wnere  R.  .a  and  R  -b  denote  the  attributes  of  k-  and  R^  in  £, 
13  1  j  k 

respectively.  M  does  not  change  the  equivalence  classes  of 

type  (1).  If  is  of  type  (2),  edges  of  the  form  (R^.a,  v) , 

v6E.  ,  v^R  .b,  are  mapped  to  (R-.b,  v)  while  other  edges  of 
K  3  3 

T,  are  unchanged.  It  u  is  of  type  (3),  let  tne  eoges 
incident  with  R^.a  in  be  (R^  .  a ,  v-^)  ,  .  .  .  ,  (R1 .  a ,  vg  )  ,  s  >  1. 

T.  is  a  tree,  so  tnere  is  a  unique  patn  in  T,  from  R. . a  to 

K  rv  x 

R  .b  passing  exactly  one  vertex  in  {  v  I  i<t<s}. 
Without  loss  of  generality  let  v1  be  that  vertex.  Tnen  M 
maps  the  eoge  (R^a,  v  ±)  to  (R^a,  R  .b)  and  the  edges 
(R . . a ,  v.),  2  <  t  <  s  are  mapped  to  (R..b,  v.)  in  T.  . 
Equivalence  classes  of  type  (4)  are  not  changed  by  M. 

(Notice  that,  such  an  equivalence  class  consists  of  only  one 
vertex,  namely  R^a,  since  Ri  is  superseded  by  R  .  ) 
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Tne  arcs  incident  with  vertices  in  different  vertex 

equivalence  classes  in  C  are  mapped  as  follows.  Consider  an 

arc  (u,v)  ,  l<t<3,  in  C  such  that  u  and  v  are  in  different 

equivalence  classes,  u  G  E  ,  v  G  E^  and  m^k.  If  neither  u 

nor  v  is  an  attrioute  of  tnen  the  arc  (u,v)t  is  not 

changed  by  M .  Otnerwise,  without  loss  of  generality  let  u  be 

R1.a.  Em  can  not  be  a  vertex  equivalence  class  of  type  (1) 

since  it  contains  R1.  It  Em  is  of  types  (2)  or  (3)  then  tne 

arc  (R^.a,  v)t  is  mapped  to  the  arc  (R^.b,  v)t,  where  R^  .  b 

is  the  attribute  of  in  E  .  Otnerwise  e^  is  of  type  (4) 

3  m  m  J  c 

(i.e.  a  single  vertex  equivalence  class  containing  only 
R^.a)  and  since  R^  is  superseded  by  R^  there  are  three 
possible  cases  tor  v: 

(i)  v  is  an  attribute  of  R^  in  E^  and  |E^|=1, 

(ii)  v  is  an  attribute  of  R  in  E  , 

J  K 

(iii)  v  is  an  attribute  of  neither  R.  nor  R  ,  but  E 

1  J  K 

contains  R_^  . 

In  cases  (i)  ano  (ii)  the  arc  (R^.a,  v)t  is  left  unchanged 
oy  M.  For  case  (iii),  let  R  . b  denote  tne  attribute  of  R^  in 
Er.  h  maps  tne  arc  (R^.a,  v)t  to  the  arc  (R^.a,  Rj.b)t« 

i_,emma  3.4:  if  C  is  a  connected  component  in  JG^GRJ  (G)  , 

then  C  ano  h(C)  have  the  same  vertex  equivalence  classes  and 
their  condensed  acyclic  graphs  are  tne  same. 


44 


Proof  :  follows  from  trie  construction  of  M  wnicn  possesses 
properties  (a)  ana  (b) .  # 

Since  JG^  is  a  reduced  join  grapn,  it  follows  from 
Lemma  3.4,  tnat  h(JG^)  is  also  a  reduced  join  grapn  for  g* 
Consequently,  RJ(G)  C  kJ (^)  (trom  Lemma  3.2),  and  botn  JGy 
ana  M(bG^)  represent  queries  which  are  equivalent  to  G  (from 
Lemmas  3.2  ana  2.1).  The  following  two  lemmas  show  that  the 
query  graph  of  a  join  graph  in  Ro  (G)  is  a  tree  iff  the  query 
graph  of  the  corresponding  join  graph  in  RJ (Q)  is  a  tree. 
Thus,  in  oraer  to  determine  the  type  of  a  query  Q  it  is 
surticient  to  consider  only  tne  join  grapns  in  RJ  (G)  •  In 
Lemmas  3.5  ana  3.6,  below,  R^  and  R^  are  any  two  relations 
sucn  tnat  Ri  is  superseded  by  R^ . 

Lemma  3.5:  It  tnere  is  a  join  graph  JG^  in  RJ(G)  sucn  that 
tne  query  grapn  of  JGj_  is  a  tree  tnen  tnere  is  a  join  graph 
in  rJ (G)  whose  query  graph  is  also  a  tree. 

Proof :  bince  RJ(G)  C  RJ (g) t  3G^  6  RJ (G) .  # 

Lemma  3.6:  Let  GGy  be  tne  query  grapn  of  a  join  grapn 

JG^  G  RJ (G) •  If  GG^  is  a  tree  tnen  the  query  graph  of  w (JG^) 

is  also  a  tree. 

Proof :  Let  Jo^  denote  M(JG^)  and  QG^  denote  the  query  grapn 

of  JG^.  we  now  show  that  GG^  is  a  tree  by  demonstrating  that 

GGj^  is  connected  and  nas  the  same  number  of  edges  as  QGy . 

T  t 

E'rom  Lemma  3.4,  JG^  =  Jo^.  Jince  gG^  is  connected,  GG-^ 
is  aiso  connected,  by  Lemma  2.2.  GGy  may  or  may  not  have  tne 
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eage  (R^k  ). 

Case  1 :  »*/G^  contains  the  eage  (R^,  ) .  Since  to  leaves 

all  edges  not  incident  with  unchangea,  any  eage  wnich  is 

in  Qg^  ,  but  not  in  QG^,  must  be  incident  on  R^.  Let  (R^, 

Rt) /  t/g ,  t^i,  be  such  an  edge.  Then  tnere  exists  at  least 

one  arc  in  that  is  inciaent  with  some  attribute  of  R^, 

R^.a,  ana  some  attribute  of  Rfc,  Rt«b. 

It  R^ .  a  ana  Rt.a  are  in  the  same  equivalence  class,  say 

Lm,  then  Lm  must  also  contain  an  attribute  of  Rj  ,  say  R  .c, 

since  R^  is  superseaea  by  R  .  Moreover,  tne  spanning  tree  l'm 

(representing  Lm)  must  contain  tne  eage  (R^.a,  R  .c)  since 

otherwise  there  woula  be  an  alternate  path  in  gg  from  R-  to 

to  1 

R^  via  some  other  vertex,  contr aaict ing  that  yG^  is  a  tree. 

Thus,  L  is  an  equivalence  class  of  type  (2),  so  tne  eage 

(R^.a,  Rfc.b)  is  mappea  to  (R^.c,  Rt.o)  by  M.  Thus,  UG^ 

contains  tne  edge  (R  ,  Rfc) . 

wow  suppose  R^.a  ana  R^.b  are  in  aifterent  vertex 

equivalence  classes,  R^.aGhm,  R^.oGh^,  m^K,  ana  let  the  arc 

in  between  be  (R^.a,  l<s<3.  If  is  not  a  single 

member  equivalence  class  then  an  attribute  of  R^ ,  say  Rj.d, 

must  exist  in  Em  since  R1  is  superseaea  by  R^ .  M  maps  the 

arc  (R  .a,  R..b)_  to  (R  .d,  R,.o)  .  It  L  is  a  single  member 
l  c.  s  j  l  o  m 

equivalence  class  tnen  L,  contains  an  attribute  of  R_ ,  say 

K  J 

R  .e.  (iMOte  tnat  n  can  not  be  a  single  memoer  equivalence 
J  K 

class  containing  only  R^  since  tne  eage  unaer  cons iaerat ion 
is  (R^.a,  Rt.b)s,  t^i,  t/j . )  Since  gGy  is  a  tree,  must 
also  contain  another  attribute  of  R. ,  say  R-.t,  ana  T  must 

11  K 
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contain  tne  unique  pat h  (R..e,  R..f),  (R..f,  R,.b)  from  R_.e 

J  1  XL  J 

to  R,.b.  to  maps  tne  arc  (R. .a,  R,.b)  to  ( R  ■  .  a ,  R_.e)  and 

L  X  L  X  J  o 

it  maps  tne  edge  (R^t,  Rfc.b)  in  TK  to  (R^.e,  Rfc.b).  Thus, 
in  either  situation  yG1  contains  the  eage  (R.,  Rt)  .  yG^  can 
not  contain  (k  ,  Rfc)  since  it  is  a  tree.  Therefore  yG^,  in 
comparison  with  ,  loses  one  eage  (R1 ,  Rfc)  out  gains 
(Rj ,  Rfc) .  Since  the  same  argument  is  applicable  for  every 

edge  (R^ ,  Rfc)  ,  t/i,  t^j  ,  yG-^  has  tne  same  number  of  eages  as 

Case  2:  yG^  does  not  contain  the  edge  (R^,  R  ).  Since 

yGy  is  a  tree  tnere  is  a  unique  path  in  yGy  from  R1  to  R  . 

Let  this  path  be  e=(R. ,R  , . . . ,R,  ,R. ),  p>l.  Consider  any 

l  p  J 

connected  component  C  in  JGy  containing  R^ .  C  also  contains 
R  •  ,  ana  nence  there  is  a  path  in  i jG,„  from  R.  to  R.  wnich  is 

j  ^  6  l  j 

mapped  Dy  the  arcs  in  C.  This  path  must  be  the  same  as  e 
since  yG^  is  a  tree.  The  remaining  arguments  are  simiiar  to 
those  of  case  1.  ft 

proposition  3.1;  Let  y  be  a  query  such  that  R^  is  superseded 
by  R  ,  for  some  l  ^  j  ,  1  <  i,j  <  n.  y  G  Ty  iff  tnere  exists 
a  tree  query  graph  corresponding  to  a  join  graph  in  Rd (y) . 
Furthermore  any  cycle  in  tne  query  graph  corresponding  to  a 
join  graph  in  RJ(y)  cannot  be  incident  on  R^ . 

Proof:  Follows  from  Lemmas  3.5,  3.6  ana  the  fact  that  R^  has 
only  one  edge  incident  on  it  in  tne  query  graph 
corresponding  to  any  join  graph  in  RJ (y) .  # 


Thus  no  existing  cycle  in  tne  query  grapn  can  be 
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incident  on  R^.  As  a  consequence,  tne  removal  of  aoes  not 
ertect  tne  query  type,  Goserve  tnat  "superseding"  is 
independent  or  wmcn  reduced  ]oin  graph  is  considered  since 
all  tne  reduced  join  grapns  nave  tne  same  vertex  equivalence 
classes  ana  tneir  condensed  acyclic  grapns  are  tne  same. 

We  now  proceed  with  the  removal  of  from  a  join  grapn 
Jb,  JgGRJ  (G)  ,  and  snow  tnat  the  join  grapn  remaining  after 
tne  removal  of  R^  is  in  canonical  form.  Tne  relation  R^  is 
removed  by  eliminating  ail  tne  attributes  of  Ri  and  tne  arcs 
incident  on  tnem  from  JG.  The  removal  of  R^  may  cause  some 
connected  components  in  JG  to  consist  of  attributes  of  a 
single  relation.  Sucn  connected  components  are  also 
eliminated  since  those  connected  components  oo  not 
contribute  any  edges  to  tne  query  grapn,  and  tnerefore  nave 
no  influence  on  the  type  of  G*  m  essence,  a  mapping  ER^  is 
defined  wnere  Er^(JG)  is  the  join  grapn  resulting  alter  tne 
elimination  of  R^  from  JG,  wnere  JG  G  RJ (G) •  (wote  tnat  if  Jo 
contains  R^  and  R^  but  no  other  relation  then  ER^ (JG)  is  a 
null  join  graph.)  From  proposition  3.1,  it  is  clear  that  for 
any  two  join  grapns  JG-^  and  JG^  in  RJ(G)/  the  queries 
represented  by  eR. (JG^)  and  EK.  (jg2)  are  equivalent, 
moreover,  tne  following  lemma  snows  tnat  ER^ (JG)  is  a 
reduced  join  grapn  tor  any  JGG  RJ (G) • 
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Lemma  3.7:  Let  LG1  be  the  join  graph  resulting  after  the 
removal  of  a  superseded  relation  R  from  JG,  i.e. 

J G1 =ER^ (JG) ,  wnere  JGGRJ(G).  Then  JG'  is  a  reduced  join 
gr  apn. 

hr  oof :  Suppose  JG1  is  not  null  since  otnerwise  tne  result  is 
trivially  true.  Sy  construction,  for  each  equivalence  class 
in  JG',  JG  contains  a  corresponding  equivalence  class 
L^  sucn  that  e^  C  E  .  from  proposition  3.1,  each 
equivalence  class  in  JG'  is  represented  by  a  spanning 
tree  of  the  vertices  in  s'  Moreover,  JG'  contains  an  arc 

K 

(u,v)  ,  l<s<3,  where  uGE',  vGE ' ,  k^m,  only  if  JG 

contains  tne  same  arc,  (u,v)g,  satisfying  uGE^,  vGEm.  Since 
JG  is  a  reduced  join  grapn  it  does  not  contain  any  reounoant 
arc  incident  with  vertices  in  different  equivalence  classes. 
This  snows  tnat  JG'  also  does  not  contain  any  redundant  arc. 
Tnerelore  JG'  is  also  a  reduced  join  graph.  # 

Thus  tne  removal  of  the  superseded  relation  R^  from  a 
join  grapn  in  RJ(G)  results  m  a  reduced  join  grapn.  Let 
RJ(G')  denote  tne  set  of  all  such  join  grapns,  i.e., 

RJ(G')  =  tJd'  |  JG'=ER  (JG)  and  JbGRJ(G)},  and  g '  be  tne 
query  represented  by  a  join  graph  in  RJ(g').  It  is  evident 
tnat  RJ(G')  is  tne  set  ail  reduced  join  grapns  for  U ' • 
Consequentiy ,  tne  type  of  tne  original  query  G  can  be 
determined  from  a  join  grapn  resulting  after  the  removal  of 
R^  from  any  join  grapn  in  RJ (g) • 
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Finally  proposition  3.2  snows  tnat  if  tne  elimination 
process  leaves  some  relations  then  tne  query  must  be  a 
cyclic  query. 


tropos ltion  3.2:  Let  y  be  a  query  invblving  relations 
{k-^,...,kn}  sucn  tnat  k^  is  not  superseded  by  R^  for  any 
i^3 /  1  <  if  3  <  n,  ana  each  connected  component  contains  at 
least  two  relations.  Then  y  S  CC. 

Proof :  Let  yb  be  tne  query  graph  of  any  jbin  graph,  say  Jb, 
in  RJ  (y)  .  From  Lemma  3.3,  it  suttices  to  snow  tnat  yb  is 
cyclic.  Consider  in  yb.  An  attribute  of  R^  is  adjacent  to 
or  from  some  vertex  whicn  is  an  attribute  of  anotner 
relation,  say  R  ,  in  a  connected  component  in  Jb.  Thus,  R^ 
ana  R_j  are  also  adjacent  in  Qb.  Since  k^  is  not  superseded 
by  Rj ,  at  least  one  of  tne  two  conditions  for  R^  to  be 
superseded  by  R^  does  not  hold. 

Case  1:  iVk  £  M  .  Then  there  exists  a  multimemoer 
equivalence  class  E,  such  that  E.BM-  and  E.0M.  .  Let  R-.a 

K  K  1  K  J  1 

be  tne  attribute  bt  R^  in  E^.  Since  |E^|>2,  there  is  at 

least  one  other  vertex,  Rt.b,  t/i ,  t/j ,  in  such  that  Rfc . b 

is  adjacent  tc  k  .a  in  tne  spanning  tree  representing  E,. 

l  x 

Thus  k ^ ,  t^i ,  t/j ,  is  also  adjacent  to  R1  in  yb  snowing  tnat 

tne  degree  of  k^  in  yb  is  at  least  2. 

Case  2:  In  tne  condensed  acyclic  grapn  of  Jb  there  is  an 

arc  incident  witn  E,  ana  E  ,  where  E.  6b-  ana 

K  m  K  L 

E  B  S-U  MU  S  .  Let  k  •  .  a  be  the  attribute  bf  R.  in  E.  . 
fir  i  j  j  i  ik 

Then  tnere  exists  at  least  one  vertex,  Rt-b,  t/i,  t^j ,  in 


■ 
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n-msucn  tnat  Jb  contains  an  arc  inciaent  witn  R^.a  ana  Rt.o. 
Thus ,  R^  ana  are  also  aajacent  in  QG.  Tnis  snows  tnat  tne 
aegree  ot  Ri  is  at  ieast  2. 

Since  tne  above  argument  is  applicable  to  every  vertex  in 
yo,  eacn  vertex  in  yG  nas  aegree  >  2,  ana  tneretore  [12J  yG 
is  cyclic.  # 


Below  we  present  an  algoritnm  which  aetermines  tne  type 
of  a  given  query  y  by  successively  examining  eacn  pair  of 
relations  R^ ,  R  in  Q  ana  removing  R^  if  R^  is  supersedea  by 
R  ,  i^j  ,  l<i,j<n.  For  any  tree-query,  Q,  tne  algoritnm 
proauces  an  equivalent  query  whose  query  graph  is  a  tree, 
ana  outputs  tne  equivalent  query  together  witn  its  tree 
query  grapn.  Tne  input  query  y  is  free  of  join  clauses  tnat 
can  reaaily  oe  elimmatea  by  local  processing,  i.e., 
initially,  eacn  connectea  component:  in  tne  join  grapn  of  Q 
contains  two  or  more  relations  ana  there  is  at  most  one 
attribute  of  a  relation  witnin  an  equivalence  class. 


Algoritnm  3.1: 

Input :  yuery  y 

k 

Output:  If  y€Ty  then  "Q6TQ" ,  an  equivalent  query,  y  , 

* 

ana  tree  query  grapn,  T,  of  y  ;  else  "y€Cy". 
1)  Construct  J G^  from  y.  Initialize  T  to  be  an  empty 

k 

ana  y  be  a  null  query. 


set 
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2)  construct  tne  vertex  equivalence  classes  E^,...,E  , 

ana  the  transitive  reauction,  bt,  of  tne  conaensea 

acyclic  qrapn  ot  J G  .  tor  eacn  relation  R-  in  g, 

g  i 

construct  tne  sets  ana  b ^  wnere 

M i =  i E 1 1  Efc  contains  R^  ana  |Etl>l}, 

S^={Etl  Et  contains  R^  ana  | E  t  j  =1 } ,  l<i<n. 

3)  while  ( R^  is  superseaea  bv  R  )  for  some  i^j , 

1  <  i , i  <  n  ao 

/*  (R^  is  superseaea  by  R^ ) =  True  if  C  ,  ana  for 

every  EtGb^,  each  E^  tnat  is  adiacent  to  or  from  Efc 
in  satisfies  E. G  S-U  M  U  S.,  where  M.U  S.  ^  0. 

K  1  3  j  11 

*/ 

Beqin  /*  remove  R^  */ 

(l)  For  every  EtGb^U  remove  the  attribute  ot  R± 

from  Et  and  it  EtGb1  also  remove  Efc  ana  all 
arcs  inciaent  with  E  from  (Jt.  Set  ivm=b^=0. 

For  eacn  such  removea  attribute  ot  R^ ,  say 

ic  k 

r^.k,  set  g  < —  g  .ana.q  where  q  is  the 

join  clause  corresponainq  to  the  arc  inciaent 

with  R^.K  in  a  ]oin  qraph  in  RJ (g) . 

(ii)  Upaate  M  ana  b^  due  to  (l). 

/*  R_j  is  tne  only  relation  tnat  may  be  affected  by 

(i)  since  R^  is  superseaea  by  R  .  */ 

(in)  It  M  =0  ana  every  E.GS.  is  such  that  there  is 
1  K  i 

no  arc  in  inciaent  with  E  ana  E  for 

k  in 

E  Gb  ,  tnen  remove  every  E,  Gb  ana  the  arcs 
m  3  k  3 

inciaent  with  them  (it  they  exist)  from  Gt.  Set 
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S_J=0. 

For  each  such 

removed  arc  a 

set  y  <--  y 

* 

. 

ana.  (tne  join  clause  corresponding 

to  a)  . 

/* 

Remove  R  , 

3 

if 

each  connectea  component 

c  on taininq 

R 

3 

in  the  remaining  join  grapn 

contains  no 

otner  relation.  */ 

(IV) 

bet  T  4 —  1 

0 

{  U^R-j)  }  • 

ena 

4)  It  ail  the  relations  are  eiiminatea  trom  y  tnen 

* 

return  (  y  8  TQ ,  y  ana  T)  else  return  (  y  8  cy)  . 


Clear 

relations 

i0g ,  among 

superseaed 

directly  t 

query  grap 

semi -30  ins 

query  grap 
* 

query  y  p 
y  Dy  semi- 
orouucea  o 


ly  the  aiqontnm  terminates  eitner  when  ail  tne 

are  eliminated  from  y  or  wnen  there  is  no  FC,  , 

the  relations  remaining  in  y,  such  that  is 

oy  R^ .  The  correctness  of  tne  algorithm  follows 

rom  propositions  3.1  and  3.2.  Rote  tnat  if  the 

h  of  a  query  y  is  a  tree  then  the  sequences  of 

to  answer  y  are  determined  airectly  by  the  tree 

n  L  3  j  .  inus,  for  any  tree  query  y  tne  equivalent 

roaucea  by  tne  algorithm  can  be  utilizea  to  answe 

* 

joins  since  y  nas  a  tree  query  grapn  (also 
y  tne  algorithm)  anq  tne  equivalent  queries  nave 


r 


tne  same  answer. 


' 
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3 . 4  An  Implementation  of  the  Algorithm 

In  step  1,  an  adjacency  matrix  is  constructed  for  each 

connected  component  in  JG  .  Let  Cn/...,C  ,  m>l,  be  the 

Q  1  m  - 

connected  components  in  JG^  and  n^  denote  the  number  of 
vertices  in  C^,  1  <  i  <  m.  The  vertices  in  the  join  graph 

can  be  numbered  so  that  the  vertices  in  the  same  connected 
component  have  successive  indices.  Let  R  .a  and  R  .b  be  the 

S  "C. 

i^  and  j1"^  vertices  respectively  m  a  connected  component 
where  1  <  i,  j  <  n^,  1  <  k  <  m.  The  adjacency  matrix  A^  of  a 

connected  component  is  an  (n^xn^)  matrix  where  rows  and 
columns  represent  the  vertices  in  C,  and 


"  1 

if 

(Rs.a, 

Rt-b)i 

is  in  C,  , 
k 

AkU,  j 

=  . 

2 

if 

(Rs.a, 

Rt *b ) 2 

is  in  C^, 

3 

if 

(Rs.a, 

Rt-bb 

or 

(Rt.b, 

Rs.a) 

3  is  in  Ck, 

^  0  otherwise 


for  i^j  and  A^(i,i)=0,  1  <  i,j  < 

since  (R  .a,  R,  .  b ) is  the  same 

Ak(if j)=Ak( j , i)=3  if  Ck  contains 

(R  .a,  R , . b ) o  or  (R..b,  R  . a K . 
s  t  3  t  s  J 

constructing  adjacency  matrices 
components  of  JG^  takes  at  most 


nn  ,  1  <  k  <  m.  Note  that 

as  ( Rt . b,  R  s ’ a ) 3 ' 

either  of  the  arcs 

It  is  clear  that 

for  all  the  connected 
m  2 

0 (  Z  n . )  time . 

i=l  1 


In  step  2,  the  vertex  equivalence  classes  and  the 
condensed  acyclic  graph,  G,  of  JGq  can  be  constructed  in 


. 
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™  2 

tune  proportional  to  >_  n..  [15J.  The  transitive  reduction, 


-i  i 


t 


i  =  l 


G  ,  of  the  condensed  acyclic  graph  G  can  be  constructed  in 
m  ^ 

at  most  0(  >_  n  )  time  (see  Appendix  A)  .  Then  the 
i=l  ± 


cons 


true tion  of  sets  lb  and  s±,  for  i=l,...,n,  can  be 


m 


performed  in  time  proportional  to  >_  n .  .  Thus  the  complexity 

m  3  i=l  1 

of  step  2  is  0(  >.  n.  )  .  Since  step  4  takes  at  most  0(n) 


i  =  l  1 


time,  for  the  overall  time  complexity  of  the  algorithm  it 
remains  to  establish  the  time  required  by  step  3. 


In  step  3,  in  order  to  determine  whether  M.  CM.  the 

i-l 

data  structures  and  the  procedure  given  in  [29]  can  be 
employed  as  follows.  (Note  that  a  multimember  equivalence 
class  in  this  context  corresponds  to  a  connected  component 
m  [29].)  For  notational  convenience,  let  E^,...,Et,  l<t<p 
be  the  multimember  equivalence  classes,  Et+l,***,^p  t^ie 
single  member  vertex  equivalence  classes  in  J G^ ,  and 
m.=|E'l,  l<i<p.  As  discussed  previously  (section  2),  each 
multimember  equivalence  class  contains  at  most  one  attribute 
of  a  given  relation  after  local  processing.  Thus,  the 
equivalence  classes  in  JG^  can  be  represented  by  an  (nxp) 
binary  matrix  with  a  1  in  row  i  and  column  j  if  relation 
is  contained  in  equivalence  class  E ^ .  The  nonzero  entries  of 
this  matrix  are  represented  by  two  sets  of  intersecting 
linked  lists  {LR^  I  l<i<n}  and  {LE^  I  1< j<p] .  The  singly 
linked  list  LR^  represents  the  set  M^U  S ^ ,  while  the  doubly 
linked  circular  list  LE^  represents  the  equivalence  class  E ^ 
(see  Figure  3.2).  A  vector  of  counts,  CE,  is  used  where 
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CE ( 1 )  is  the  number  of  relations  in  E.,  i.e.,  CE ( i )  =  I E .  I  - 

1  i 

The  check  for  C  fh  ,  l<i, j<n,  Mi^0,is  facilitated 

by  an  (nxn)  matrix,  Count.  Initially,  Count ( i , j )  =  I  fh  | , 

l£i,  j£n,  and  as  the  executions  (to  be  described)  proceed, 

Count ( i, j )=l M±-Mj I ,  i^j  .  Clearly,  Count(i,j)=0  iff  C  M ^ . 

To  find  the  set  of  M's  contained  in  a  given  M ^ ,  the  linked 

list  LR j  is  traversed.  This  list  intersects  with  the  lists 

of  equivalence  classes,  {LE.  |  E  ■  contains  R-},  containing 

s  ^  s  ^ 

R • .  Each  occurrence  of  any  relation  R  ,  u0j,  in  the  list  LE  • 

J  J  - 

indicates  that  the  equivalence  class  E.  also  contains  R  . 

1  u 

j  t 

Thus,  Count(u,j)  should  be  decremented  by  the  number  of 

occurrences  of  R  in  such  lists  {LE-  }.  This  is  accomplished 

Ds 

by  traversing  the  circular  linked  lists  (LE-  }  while 

■*  s 

decrementing  Count(u,j)  for  each  occurrence  of  R  ,  uj^j,  in 

those  lists.  Each  node  in  LR .  denotes  an  occurrence  of  R-  in 

3  3 

an  equivalence  class,  E  .  Vihile  traversing  the  list  LE  , 

s  s 

each  node  in  the  list  is  visited  once  and  the  traversal  of 

LE  resumes  when  the  node  in  LR .  is  reached  (note  that  if 
s  3 

|E  1=1  then  LE  consists  of  one  node,  that  is  the  node  in 
s  s 

LR.).  Thus  the  number  of  nodes  visited  in  LE  is  the  number 
3  s 

of  vertices  in  E  .  Since  this  process  is  carried  out  for 

s  1 


each  node  in  LR^  and  for  each  list  LR j ,  the  total  time 
required  for  determining  whether  or  not  C  M ^ ,  for  all 


£  2 

i0j,  l£i,3fn,  is  0(  mk),  where 

k  —  1 


m-, 


Ev I /  l£k<p. 


In  order  to  check  for  the  second  condition  of 
supersedence  in  step  3,  an  (nxn)  matrix,  Scount,  is 
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a 


(a)  A  Join  Graph  JG 


(b)  The  Matrix  Representation 


^ 


R4  .a 

E3 

-A 

/ 

' 

f 

Rr  .  a 

D 

E3 

JL 

Figure  3.2.  The  Linked  List  Structure  Used  by  the 


A1 gor ithm . 
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utilizeci.  All  entries  of  Scount  are  initially  0.  F 


or  a  given 


relation  R^,  the  set  of  relations  R^,  i^j ,  l<j<n,  such  that 

the  second  condition  for  R_^  to  be  superseded  by  R ^  is 

satisfied,  is  determined  as  follows.  For  each  E,  m  S .  , 

k  1 

every  arc  incident  with  E,  in  is  traversed.  Let  E  be  a 

k  u 

vertex  adjacent  to  or  from  E,  in  G 1 .  If  E  0S.  then 

k  u  i 

Scount(  i,  j  )  for  all  j  satisfying  E  6  M  -U  S.,  j^i,  and 
Scount(i,i)  are  incremented  by  1.  (This  is  performed  by 
traversing  the  linked  list  LE^  while  incrementing 
Scount(i,i)  by  1  and  incrementing  Scount (i, j)  by  1  for  each 
occurrence  of  R^,  l^j,  in  the  list. )  After  repeating  this 
process  for  every  vertex  in  Gt  that  is  adjacent  to  or  from 

En_  in  S ^ ,  Scount  (  i,  i  )=Scount(  i,  j  ) 


E^,  and  for  every  ^ 


for  i^j  indicates  that  the  second  condition  for  R^  to  be 

superseded  by  R^  is  satisfied  (i.e.,  R^  is  superseded  by  R^ 

if  C  Mj  and  Scount ( i, i )=Scount( i, j )) .  Clearly,  the  number 

of  operations  required  for  each  E,  ,  EVGS  ,  is  proportional 

K.  K.  1 

T 

to  the  number  of  arcs  in  JG^  incident  with  the  attribute 

of  R.  which  is  in  ET  .  Thus,  all  the  relations,  R.,  such  that 

the  second  condition  for  R^  to  be  superseded  by  R^  is 

satisfied  can  be  determined  in  time  proportional  to  the 

T 

number  of  arcs  in  JG  incident  with  those  attributes  of  R. 

v. 

which  are  in  single  vertex  equivalence  classes.  This  process 
is  carried  out  for  every  relation  R  ,  lfifn.  Hence,  the 
total  time  required  to  check  for  the  second  condition  of 
supersedence  for  every  pair  of  relations  R^ ,  R  j ,  i^ j , 

1  c  i,  j  <  n,  is  at  most  O(e^)  where  e^  is  the  total  number 
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T 

of  arcs  in  JG^  incident  with  vertices  that  are  in 
different  equivalence  classes. 


If  a  relation,  R^,  is  superseded  by  another  relation, 

Rj,  then  is  removed  (step  3-i).  The  removal  of  R^  is 

performed  by  traversing  the  linked  list  LR  once,  and 

deleting  every  node  in  LR^  as  it  is  encountered.  Note  that 

each  node  in  LR^  is  also  a  node  in  an  intersecting  list  LE-^., 

where  E,  is  either  in  S.  or  in  M..  The  operations 
k  1  1  1 

accompanied  by  the  removal  of  R^  are  described  below  in  two 
cases . 

Case  1 :  E,  G  S. .  Then  the  node  in  LR .  is  the  only  node  in 
-  k  l  l  1 

the  intersecting  list  LE^.  Thus,  by  the  removal  of  the  node 

in  LR^,  the  linked  list  LE^.  becomes  empty,  and  is  deleted 

from  the  list  structure.  The  vertex  E^  and  every  arc 

incident  with  E,  are  deleted  from  G1" .  For  each  such  deleted 

k 

arc,  (E^,  E  )  t  or  ( Ein ,  Ev )  ^ ,  l<t<3,  a  corresponding  join 


u'  k  t 


clause  is  stored  in  Q  and  the  matrix  Scount  is  updated  as 


follows.  Let  R..a  be  the  attribute  of  R.  in  Ev .  Since  R.  is 
superseded  by  R^,  Eu  is  either  in  S i  or  in  MjU  Sj.  Let  v 
denote  the  attribute  of  R^  in  Eu  if  E^S^,  and  the  attribute 
of  R  •  in  E  otherwise.  The  vertex  v  is  determined  by 

1  u 

traversing  the  nodes  in  the  linked  list  LE^.  Then  the  join 
clause  corresponding  to  the  arc  (Ep'Eu)t 

( R  . a  >  v)  ((R±.a  <  v))  if  t=l ;  (R±.a  >  v)  ((R±.a  <  v))  if 

t=2 ;  and  (R  .  a  ^  v)  ((R^a  £  v))  otherwise.  Observe  that  if 
E_6S_.  then  (since  E-^GS^)  the  arc,  (Ev,  En)t  or  ^u'  ^k^t' 


u  1 
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l<t<3,  has  been  countea  previously  while  determining  the  set 
of  relations  R^,  m^j,  sucn  tnat  trie  secona  condition  for 
to  be  superseaea  oy  R^  is  satisfiea.  Since  tne  arc  incident 
with  E^  and  eu  in  was  just  aeletea  due  to  the  removal  of 
Ki  r  it  Su6  tnen  Scount(j,j)  ana  scount(j,i)  shoula  be 
aec  rementea  by  1.  Thus,  for  every  sucn  arc,  (E  ,  e  )  or 
(Eu,  E^)t,  aeletea  from  G1"  aue  to  the  removal  of  r.,  where 
E^Sb  ^ ,  l<t<3,  Scount(j,j)  ana  Scount(j,i)  are  aecremented  by 
1  it  E^GS^ .  The  number  of  counts  tnat  has  to  be  aecrementea 
in  the  matrix  Scount  for  E^  is  at  most  twice  as  many  as  the 
number  of  arcs  in  G1"  incident  with  E,  ,  E  Gb  .  .  For  each  arc 

K  K  1 

in  Gt  incident  witn  E^ ,  exactly  one  ]om  clause  is  stored  in 
* 

Q  ana  the  time  requirea  to  aetermine  the  join  clause 
associated  with  such  an  arc  (E,  ,  E  ).  or  (E  ,  E,  ).  is  at 
most  proportional  to  tne  number  of  vertices  in  E^ .  Thus,  if 
E  GS ■  then  the  total  time  required  by  operations  accompaniea 
by  the  removal  of  the  noae  from  LR^  is  proportional  to  tne 
number  of  arcs  in  JG^  incident  with  the  attribute  of  R.^  in 

V 

Case  2:  Elr  G  h- .  men  the  noae  in  LR^  is  deleted  from  tne 
intersecting  (aoubiy  linkea)  list  Lt,^.  The  count  associated 
witn  LE  ,  CE(k),  is  aecremented  by  1.  Let  k.  be  the 
attribute  of  R-  in  E  .  bince  R.  is  superseaea  oy  R_ , 

1  K  x  J 

1*1 .  c  M  ,  ana  hence  E,  also  contains  an  attrioute  of  R..  Let 
i  —  J  k  J 

R..0  be  tne  attribute  of  R  in  E,  ,  whicn  can  be  laentif  lea 
J  J  K 

by  a  traversal  of  tne  list  LE^.  Then,  an  equi-jom  clause 

* 

(R..a  =  R  .b)  is  stored  in  Q  .  Thus,  if  E.GM.  tnen  tne 
l  J  K.  1 


I. 


removal  of  tne  noae  from  Lk^  is  performed  in  time 
proportional  to  tne  number  of  vertices  in  E.  . 

K 

Consequently,  tne  time  required  for  tne  removal  of  any 

node  from  LR^  is  proportional  to  tne  number  of  arcs  in  JG^ 

incident  witn  tbe  attribute  of  R^  corresponding  to  tne 

removed  node.  Since  every  node  in  LR^  is  removed  in  tnis 

manner,  tne  total  time  required  oy  tne  removal  of  k^  is 

proportional  to  tne  number  of  arcs  in  JG,1  incident  witn 

Q 

tne  attributes  of  k^.  All  tne  relations  are  removed  in  tne 

worst  case,  wnicn  takes  0(e)  time  wnere  e  is  tne  number  of 
T 

arcs  m  J G^ . 

Q 


Tne  removal  of  a  relation,  R^,  may  cause  some 
multimember  vertex  equivalence  classes  to  become 
singlemember  equivalence  classes.  Tnis  may  in  turn  cause 
some  relations  to  become  superseded  by  some  otner  relations. 
Tnus,  tne  matrices  Count  and  Scount  nave  to  oe  updated.  Lee 
S=  t  E  |l<s<pj  be  tne  set  ot  equivalence  classes  wnicn  oecome 
single  memoer  equivalence  classes  after  tne  removal  ot  R^. 
The  set  S  can  be  formed  during  tne  removal  of  R^.  Let  k^  be 
tne  relation  tnat  superseded  R^ ,  i^j.  Tbe  relation  contained 
in  any  equivalence  class,  E  ,  in  S  is  R.  because  E„ 

O  I  »D 

previously  contained  an  attribute  of  R^  and  an  attribute  of 
Rj ,  and  tne  attribute  ot  R^  was  just  removed.  Thus,  after 
tne  removal  of  R^,  each  equivalence  class  in  S  becomes  a 
member  of  wnile  eacn  suen  equivalence  class  was  a  member 
of  M  before  tne  removal. 
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Since  ML  becomes  Mj-S,  the  matrix  Count  has  to  be 

updated  accordingly  (step  3-ii).  That  is,  if  there  are  f 

equivalence  classes  in  S,  f>0,  then  all  the  counts 

associated  with  ML,  i.e.,  Count(j,k),  l<k<n,  are  decremented 

by  f.  The  number  of  counts  that  have  to  be  decremented  is  at 

most  n-1,  since  the  number  of  relations  to  be  removed  until 

the  algorithm  terminates  is  at  most  n,  the  updating  of  the 

2 

matrix  count  takes  at  most  0(n  )  time. 


Similarly,  since  S^  becomes  S^U  S  after  the  removal  of 
IL  ,  all  the  counts  associated  with  Sj,  i.e.,  Scount(j,k), 


l<k<n,  have  to  be  updated  for  the  equivalence  classes  in  S 
(step  3-ii).  This  proceeds  by  traversing  all  the  arcs  in  G 


t 


that  are  incident  with  E  ,  for  every  E  GS  .  Let  E  be  a 

s  J  s  u 

vertex  that  is  adjacent  to  or  from  E  in  ,  E  GS .  If 

s  s 

E  0S •  then  Scount(j,j)  and  Scount(j,k)  for  all  k 
^  II 

satisfying  E  GMk  U  S-,  ,  k^j,  are  incremented  by  1.  This  is 
UK  K 

repeated  for  every  vertex,  E  ,  that  is  adjacent  to  or  from 

E  in  G*"  and  for  every  E  in  S .  Thus  the  updating  of  the 
s  s 

matrix  Scount  requires  twice  as  many  operations  as 

T 

traversing  those  arcs  JG^  which  are  incident  with  an 
attribute  of  R .  in  E  and  some  other  vertex  in  a  different 

1  s 

equivalence  class,  exactly  once  for  every  E  GS  .  However, 

s 

each  equivalence  class  in  S  was  in  ML  before  the  removal  of 

R  ,  and  became  a  member  of  S ■  after  the  removal  of  R ■ •  In 
i  D  1 

the  worst  case,  until  the  termination  of  the  algorithm  each 


equivalence  class  may  become  a  single  member  equivalence 
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class  only  once.  Thus  the  total  time  required  to  update  the 

matrix  Scount  is  at  most  O(e^),  where  e^  is  the  total  number 

T 

of  arcs  in  JG^  incident  with  vertices  that  are  in 

U 

different  vertex  equivalence  classes. 


As  described  above,  if  fb  is  superseded  by  R ^ ,  i^j,  and 

is  removed  then  the  only  relation  that  is  affected  by  the 

removal  is  .  After  the  removal  of  R^,  R ^  may  become  an 

"isolated"  relation,  (i.e.,  every  connected  component 

containing  R^  in  the  remaining  join  graph  may  contain  no 

other  relation),  or  R^  may  become  superseded  by  some  other 

relations.  These  cases  are  determined  as  follows.  After 

updating  the  matrices  Count  and  Scount  due  to  the  removal  of 

R^,  Count(j,j)=0  and  Scount ( j , j ) =0  indicates  that  Mk=0 

and,  either  S  .=0  or  for  every  E,  SS  • ,  every  vertex  adjacent 

J  K  D 

to  or  from  E,  in  is  also  in  S.,  i.e.,  R.  is  an  isolated 

k  D  J 

relation  (step  3-iii).  On  the  other  hand,  Count(j,k)=0 
Scoun t( j , j ) =Scoun t( j , k ) ,  k^j,  indicates  that  R ^  has  become 
superseded  by  R^ .  In  either  case,  R ^  has  to  be  removed  as 
described  previously.  (Observe  that  if  R ^  is  removed  as 

being  an  isolated  relation  (step  3-iii)  then  all  the  join 

* 

clauses  stored  in  Q  associated  with  the  removal  of  R ^ 
involve  attributes  of  only  R^  and  no  other  relation.)  The 
time  required  for  the  removal  of  R^  has  already  been  taken 
into  consideration.  Associated  with  every  relation  R^,  which 


is  removed  as  being  superseded  by  some  other  relation  R ^ , 
i^j,  an  edge  ( R^ ,  R ^ )  is  stored  in  T  (step  3-iv) .  Since  at 


. 
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most  n-1  such  edges  are  stored  in  T,  the  overall  complexity 

2 

o±  step-3  is  at  most  0(max(n  ,e))  where  e  is  the  number  of 
T 

arcs  in  J G  . 

Consequently,  the  worst  case  time  complexity  of  the 

P  m  _  m  _ 

algorithm  is  0(max(nz,  Z  n.))  since  e  <  Z  rn  ,  where  n 

i  =  l  1  ~  i=l  1 

is  the  number  of  relations  and  n ^  is  the  number  of  vertices 
in  connected  component  C^,  l<i<m. 

For  each  vertex  in  a  join  graph  of  Q,  there  is  a  node 

in  the  linked  lists  {LFh  |  l<i<n}  U  {LE^  I  1< j<p} .  Thus  the 

storage  requirement  for  the  two  sets  of  lists  is 

proportional  to  the  number  of  vertices  in  a  join  graph  of  Q, 
m 

i.e.,  0(  Z  n . ) •  Clearly,  the  storage  requirement  for  the 
i  =  l  1 

adjacency  matrices  of  all  the  connected  components  is 

-2  2 
0(  >_  n.).  Since  the  matrices  Count  and  Scount  take  O(n^) 

i  =  l  1 

storage  the  overall  storage  complexity  of  the  algorithm  is 
2  m  2 

0(max( n  ,  Z  n  )  )  . 

i  =  l  1 

3 . 5  Summary 

A  canonical  representation  of  a  distributed  relational 
query,  called  a  reduced  join  graph  is  introduced.  The  query 
represented  by  a  reduced  join  graph  does  not  contain  any 
redundant  join  clauses,  and  is  equivalent  to  the  original 
query.  A  conceptually  simple  algorithm  for  determining  the 
type  of  such  a  query  has  been  presented.  For  tree  queries, 
the  presented  algorithm  produces  an  equivalent  query  whose 
query  graph  is  a  tree,  and  outputs  the  equivalent  query 


. 


I 
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together  with  its  tree  query  graph.  Tree  queries  can  always 
be  answered  by  semi-joins,  which  usually  can  be  computed 
with  much  less  data  transmission  than  joins.  Moreover,  a 
sequence  of  semi-joins  that  can  be  used  to  answer  a  tree 
query  simply  corresponds  to  a  traversal  of  a  tree  query 
graph  of  an  equivalent  query  [3].  Thus,  the  presented 
algorithm  can  effectively  be  utilized  to  improve  the 
performance  of  distributed  query  processing  mechanisms. 


CHAPTER  4 


DISTRIBUTED  QUERY  OPTIMIZATION  FOR  TREE  QUERIES 

4 . 1  Introauct ion 

This  cnapter  presents  an  approach  to  fina  the  optimum 
sequence  of  semi-joins  for  answering  a  class  of  tree 
queries.  Queries  nave  conjunctive  equi-join  qualifications, 
ana  tne  number  of  common  joining  attributes  between  any  two 
relations  is  at  most  one.  Considering  only  such  queries  has 
some  simplifying  consequences  on  the  notation  and  tne 
results  of  the  previous  two  cnapters,  which  are  summarized 
bel ow . 

Since  the  qualification  of  a  query  Q  is  a  conjunction 

of  equi-join  clauses,  the  join  grapn  representing  Q  consists 

of  only  type-1  arcs  (see  Section  2.2).  Consequently,  sucn  a 

query  Q  can  be  represented  by  an  undirected  join  grapn  JG^ 

where  each  equi-join  clause  is  represented  by  an  eage  as 

opposed  to  two  type-1  arcs  in  the  representation  given  in 

T 

Section  2.2  (see  Example  4.1).  The  transitive  closure  JG^ 
of  a  join  graph  J G^ ,  in  this  context,  is  obtained  by  adding 
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all  possible  edges  to  the  join  graph  so  that  every  connected 

component  becomes  maximally  connected.  For  a  given  query  Q, 

each  connected  component  in  the  join  graph  JG^  consists  of  a 

vertex  equivalence  class.  Consequently,  if  there  is  an 

attribute  of  a  relation  in  a  connected  component  in  JG^  then 

the  additional  attributes  of  the  same  relation  can  be 

eliminated  from  the  connected  component  by  local  processing 

(see  Section  3.2).  A  reduced  join  graph  of  a  query  Q  in  this 

T 

context  is  a  spanning  forest  of  the  transitive  closure  JG^ 
of  the  join  graph. 


Exampl e  4.1:  Consider  the  following  equi-join  query 
Q  =  ( . a  =  R2 ♦ a ) . and . ( . a  =  Rq . a ) . and . ( R-^  .  a  =  R 
. and . ( R^ . b  =  R^ .b ) . and . ( R^ . b  =  R^.b) 

T 

The  join  graph  JG^;  transitive  closure  JG^  and  a 
join  graph  for  Q  are  given  below. 


reduced 


■ 
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Let  Q  and  be  the  queries  represented  by  JC^  and  JG 


Q. 


respectively.  Then 


Q  =  (R-^.a  =  Rq  » a )  .  and  .  (R^.a  =  R^  .  a  )  .  and  .  (  R^  .  a  =  R^  .  a  ) 

. a  nd .  ( R^ . b  =  R^ .b ) ♦ and .  (  R^  ♦  b  =  R^  .  b )  . and ♦  ( R2  .b  =  R^.b) 
Qj_=  (R-^.a  =  R2  •  a )  .and .  (  R^  .  a  =  R^.a) 

. and .  ( Rq . b  =  R^ . b ) ♦ and .  ( R  ^ . b  =  R^.b) 

★ 

The  queries  Q,  Q  and  are  equivalent  queries  since 
T  T 

JGq=JGu.  * 


There  is  a  one-to-one  correspondence  between  the 
joining  attributes  of  a  relation  and  the  connected 
components  of  the  join  graph.  Hence  the  attributes  of 
relations  referenced  in  Q  can  be  renamed  so  that  the 
attributes  which  are  the  vertices  of  the  same  connected 
component  in  the  join  graph  have  the  same  name  (see  Example 
4.1).  Any  two  relations  R^,  R^  are  said  to  have  a  common 
attribute  if  a  connected  component  contains  attributes  of 
both  R^  and  R^  (e.g.  relations  R2  ,  R^ ,  R^  of  Example  4.1 
have  attribute  b  in  common).  Since  two  relations  R^,  R ^  have 
at  most  one  common  attribute,  is  sufficient  to 

denote  the  semi-join  without  referencing  the  joining 
attr lbu  te  s . 


The  query  graph  of  a  query  is  as  described  previously 
(Section  2.6).  The  following  example  illustrates  the  query 
graphs  of  three  equivalent  queries. 


. 
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Example  4.2:  Consider  the  query  Q  in  example  4.1  ana  queries 
* 

Qj_  ana  Q  tnat  are  equivalent  to  Q.  The  query  graphs  of  Q, 

* 

Q-^  ana  Q  are  given  below. 


The  following  lemma  snows  that  there  is  a  one-to-one 
cor responaence  between  the  edges  of  tne  join  grapn  of  a 
query  Q  ana  tne  eages  of  its  query  grapn. 


Lemma  4.1:  Let  JG^  ana  qg  be  the  join  graph  ana  the  query 
grapn  of  a  query  q  respectively.  Then 

(a)  There  is  a  one-to-one  cor responaence  between  the 
eages  of  J  G^  ana  the  eages  of  QG. 

(b)  There  is  a  one-to-one  cor responaence  between  the  join 
clauses  of  Q  ana  tne  eages  of  QG. 

Proof :  (a)  Since  any  two  relations  ,  R ^  nave  at  most  one 

attribute  in  common,  no  two  eages  of  JG^  can  map  to  the  same 
eage  in  the  query  graph  QG. 

(b)  By  definition,  there  is  a  one-to-one  cor responaence 
between  the  join  clauses  of  a  query  Q  ana  the  edges  of  its 
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join  graph  JG^.  Then  the  result  follows  from  (a).  # 

Since  we  consider  only  tree  queries  in  this  chapter, 
the  following  notations,  unless  stated  otherwise,  will  be 

used  throughout  the  chapter.  Q  is  a  tree  query.  JG^  is  the 

★ 

transitive  closure  of  the  join  graph  JGQ  of  Q.  Q  is  the 

O'  *  * 

query  represented  by  J G^  and  QG  is  the  query  graph  of  Q  . 
The  following  lemma  states  that  a  query  is  a  tree  query  iff 
any  reduced  join  graph  for  the  query  (i.e.  any  spanning 
forest  of  the  transitive  closure  of  its  join  graph)  has  a 
tree  query  graph. 

Lemma  4.2:  Q  is  a  tree  query  iff  the  query  graph  of  a 
reduced  join  graph  is  a  tree. 

Proof :  If  Q  is  a  tree  query  then  there  is  a  spanning  forest 
T 

of  J G^  (i.e.  a  reduced  join  graph)  having  a  tree  query 
graph,  by  definition.  If  a  reduced  join  graph  for  Q  has  a 
tree  query  graph  then  all  the  reduced  join  graphs  must  have 
tree  query  graphs  since  the  query  graphs  corresponding  to 
spanning  forests  of  JG^  have  the  same  number  of  edges  (by 
Lemma  4.1-(a))  and  are  connected  (by  Lemma  2.2).  # 

* 

Any  path  m  QG  between  two  relations  having  an 
attribute  a  in  common,  contains  only  relations  that  have 

attribute  a  if  no  relation  appears  more  than  once  in  the 

★ 

path.  Furthermore,  any  spanning  tree  of  QG  is  a  tree  query 
graph  of  a  query  equivalent  to  Q.  These  are  shown  by  the 


following  lemma. 


, 
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Lemma  4.3:  For  a  tree  query  Q, 

* 

(a)  If  P(R^,R^)  is  a  patn  Detween  and  R_^  in  QG  sucn 

that  no  relation  appears  more  than  once  in  tne  path 

ana  R^  ana  R^  have  attribute  a  in  common,  then  every 

relation  in  the  path  P(Ri,R_])  has  attribute  a. 

* 

(b)  Any  spanning  tree  of  QG  is  a  tree  query  graph  of  an 
equivalent  query. 

* 

Proof :  (a)  Let  P(R^,R^)  be  a  patn  in  QG  ,  where  R^  and  R^ 

have  attribute  a,  but  a  relation  Rfc  in  tne  path  does  not 

have  attribute  a.  By  Lemma  2.1(a),  the  edges  in  P(R^,R^) 

T 

correspona  to  a  unique  set  of  edges,  E,  m  JG^.  Since  no 

relation  appears  more  than  once  in  P(R^,R^),  there  exists  a 

spanning  forest  JG  of  JG^  sucn  that  JG  contains  the  eages 

in  E  ana  possibly  some  other  eages.  Let  QG  be  the  query 

graph  of  JG.  we  now  show  that  there  are  at  least  two 

aistinct  paths  between  R^  ana  R^  in  QG.  Clearly,  QG  contains 

the  path  P(R^,R  )  which  passes  through  Rfc .  Consiaer  the 

unique  path  between  R^.a  ana  R^.a  in  tne  connectea  component 

of  JG  corresponding  to  the  attribute  a.  This  path  is  mappea 

to  a  unique  path  between  R^  and  R^  in  QG  which  does  not  pass 

through  R^_  since  R^  does  not  have  attribute  a.  Thus  there 

are  at  least  two  paths  between  R^  and  R_^  in  QG,  which 

contradicts  tne  fact  that  Qci  is  a  tree  (  by  Lemma  4.2)  . 

(b)  Let  T  be  a  spanning  tree  of  QG  .  The  edges  of  T 

-T 

correspona  to  a  unique  set  of  edges,  U,  in  JGq,  by  Lemma 
4.1(a).  It  is  sufficient  to  show  that  U  is  a  spanning  forest 
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_  T  .  T 

m  since  any  spanning  forest  of  JG^  is  a  join  grapn 

of  a  query  equivalent  to  Q.  Consider  any  two  relations  R^ 

ana  R_^  in  a  connected  component  corresponding  to  an 

attribute,  say  a,  in  JG^.  Since  T  is  a  spanning  tree, 

tnere  is  a  unique  path,  p,  between  R^  ana  R^  in  T.  By  part 

(a) ,  eacn  relation  in  the  path  p  has  attribute  a.  Thus,  the 

path  P  in  T  between  R^  ana  R_^  corresponds  to  a  path  between 

R^.a  ana  R_^  .  a  in  the  connected  component  representing 

T 

attribute  a  m  JG^.  Since  the  above  argument  is  true  for 

every  pair  of  relations  in  a  connected  component  in  JG  ,  U 

Q 

T 

covers  at  least  a  spanning  forest  of  JG^.  # 


Given  a  tree  query  Q  with  a  target  list,  tne  attributes 
contained  in  the  target  list  are  called  output  attributes 
ana  the  relations  with  the  output  attributes  are  called 
output  relations.  The  site  of  the  network  at  which  the 
answer  of  the  query  is  requirea  is  called  the  result  noae.  A 
relation  R  is  said  to  be  fully  reduced  with  respect  to  Q  if 
all  the  tuples  of  R  which  do  not  satisfy  the  qualification 
of  Q  are  eliminated.  For  a  tree  query  Q,  it  has  been  shown 
[3]  that  any  relation  R  can  oe  fully  reduced  with  respect  to 
q  by  a  sequence  of  semi-joins  traversing"1"  tne  tree  query 
graph  of  an  equivalent  query  from  the  leaves  to  the  root  R 
in  a  breadth-first  oraer. 


+A  semi-join  R.-->R.  traverses  the  edge  ( R^ , R j )  of  a  query 
graph  from  R^  to  R j . 


i 
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Examp 1  e  4.3:  Consider  the  following  query  Q  where 

Q  =  ( Rx . a  =  R2 .a) . and. (R2 .a  =  R, . a ) . and. ( R2 .b  =  R4>b) 
The  query  graph  QG  of  Q  is 


QC: 


A  sequence  of  semi-joins  z  =  <RA-->-R2,  R3~->R0,  R0-->R1  > 


2 '  2 


traverses  QG  from  the  leaves  to  the  root  R1  in  breadth-first 
order.  After  the  first  two  semi-joins  R4~->R2,  R3~->R2  are 
executed,  the  tuples  of  R2  which  do  not  satisfy 
Qsub  =  =  Rn  ‘  ^  ^  *  anc^  •  (  R3  •  a  =  R3*a)  are  eliminated.  That 

is,  R2  is  fully  reduced  with  respect  to  Q  ^  which  is  the 
subquery  of  Q  corresponding  to  the  subtree  of  QG  with  root 
R2 .  Similarly,  after  the  last  semi-join  R2-->R1  ^ s  executed, 
R^  is  fully  reduced  with  respect  to  Q  since  the  tuples  of  R2 
satisfy  the  subquery  before  the  semi-join  R2-->R-^  .  # 


4.2  Problem  Formulation 


The  problem  of  optimizing  tree  queries  by  serai-joins  is 
that  of  finding  the  optimum  sequence  of  semi-joins  fully 
reducing  the  output  relations  with  respect  to  the  given 
query.  After  the  output  relations  are  fully  reduced,  their 
projections  over  the  output  attributes  can  be  transmitted  to 


the  result  node. 
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Tile  cost  of  a  semi-join  — >R_^  ,  denoted  Dy 
Cos  t  ( R^-->R_^ )  f  is  tde  cost  of  data  transmission  (from  the 
site  containing  R^  to  that  of  R  ^  )  required  to  compute  tne 
semi-j  oin,  i .e . 

Cost(Ri— >Rj)  =  +  C1.wa.|Ri[a]  |  r 

where  a  is  the  attribute  common  to  R^  ana  R.,  |  R^  [  a  j  |  is  the 

number  of  distinct  values  in  the  column  a  of  relation  R. ,  w 

1 '  a 

is  the  average  width  of  a  data  value  in  that  column.  Cn  ana 
^  are  fixed  constants.  The  size  of  K^[aj  is  the  product  of 
the  number  of  distinct  values  in  the  column  of  the  relation 
Ri  corresponding  to  the  attribute  a  ana  the  width  of  that 
column.  The  cost  of  a  strategy,  i.e.  a  sequence  semi-joins, 
is  the  sum  of  the  costs  of  the  semi-joins  employee  in  the 
strategy.  (This  cost  criterion  is  called  total  time  cost  in 
[13]).  It  turns  out  that  we  need  to  estimate  the  number  of 
distinct  values  in  a  column  of  a  relation  after  one  of  its 
columns  is  reduced  by  a  semi-join.  The  estimation  can  be 
done  using  the  following  assumptions. 

Assumption  1:  The  values  in  a  column  of  a  relation  are 
uniformly  distributed;  ana  the  values  in  two  different 
columns  in  different  relations  corresponding  to  a  common 
attribute  are  independent.  This  is  the  same  assumption  used 
in  [13] . 

Assumption  2:  If  the  numoer  of  distinct  values  in  one 
column  of  a  relation  is  reduced,  the  numoer  of  distinct 
values  in  each  of  the  other  columns  of  the  same  relation  is 
reduced  proportionally.  In  [13],  it  is  assumed  that,  after  a 


. 


column  is  reduced,  the  number  of  distinct  values  in  other 
columns  of  the  same  relation  is  unchanged.  Both  assumption  2 
and  the  assumption  in  [13]  may  lead  to  some  inaccuracies. 

In  this  chapter,  assumption  2  is  used  only  to 
illustrate  the  optimization  procedure  (to  be  presented) 
which  -is  also  valid  for  other  estimation  methods.  The 
estimation  of  the  amount  of  data  transfer  required  by  a 
sequence  of  semi-joins  using  the  assumptions  1  and  2  is 
illustrated  in  the  following  example. 

Exampl e  4.4:  Consider  the  query  Q  of  example  4.3,  where 

Q  =  (  R-^  .  a  =  Rq  .  a )  .  and.  (  •  a  =  R^  •  a)  .  arid.  (  R^  .b  =  R^.b)  and 

the  sequence  z^  =  <  R-^ — >R2'  ^2 — ^R3>*  Let  ^  and  B  be  the 
domains  corresponding  to  attributes  a  and  b  respectively. 

Let  w  and  W,  be  the  width  of  a  data  value  in  the  domains  A 

3.  D 

and  B  respectively.  Then, 

Cost(z^)  =  Cq  +  C1.wa.|R1[a]|  +  C0  +  .  I  R^C  a  ]  I  , 

where  iR^Ca]!  is  the  expected  cardinality  of  R2[a]  after 

R1~->R2  is  executed.  Let  I A I  be  the  cardinality  of  domain  A. 

By  Assumption  1,  the  expected  number  of  distinct  values  in 

R?[a]  satisfying  P^.a  =  R2  .  a  is  pa  *  I  R2  C  a  ]  I  where 

p  =  |r  n  C  a  H  I  /  I A I  is  the  probability  that  a  given  value  in 
al  1 

doma  in  A  is  in  R^a]  .  Thus,  I  R^La]  I  =Pa  •  I  R2 C a ]  I  . 

Similarly,  the  estimated  cardinality  I R^ [ a ] I  of  R^Ca] 

after  z.  is  executed  is  |  R  L  [  a  ]  I  =  p  .p_  .  I  R-^  C  a  ]  I 

1  a  ^  ^2 

=  n  .d  .p  •  I A I  .  Consider  a  sequence  z0  =  < R^  —  > R,  ,  Rn — >  R0  , 
al  a2  a3  2  3  1  1  2 

R2 — >  R-  >  .  The  estimated  cardinality  I  R^  C  a  ]  I  of  R^La]  after 


' 
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z„  is  executed  is  also  given  oy  p  .p  .p  .  |A|,  since  R-  can 

al  a2  a3  3 

not  reduce  itself. 

Consider  the  sequence  z3  =  <r  — »r  ,  r2 — >R4>. 

Cos  t ( z  3 )  =  2C0  +  C1(wa. |R1[a]  I  +  wb- I Rj [b] | ) , 

where  |Ri[Dj  |  is  the  estimated  cardinality  of  R2[bJ  after 

tne  semi-join  R^ — >R2  is  executed.  The  cardinality  of  R^fa] 

after  R,-->R„  is  estimated  to  be  p  . lR„[aJ|  (by  assumption 

1)  ano  tnen  L  J  is  pa  .  |R2[dj|  (by  assumption  2). Thus, 

after  z2  is  executed,  tne  number  of  distinct  values  in  R^fo] 

becomes  p„  .p  . | R. [bj  | .  # 

al  2  q 

In  this  chapter,  we  first  consider  tree  queries  with  a 
single  output  relation,  ano  find  the  optimum  sequence  of 
semi-joins  fully  reducing  the  output  relation.  In  section 
4.4.3,  the  optimum  sequence  of  semi-joins  fully  reducing  any 
one  of  the  relations  referenced  by  a  tree  query  will  be 
given  for  queries  in  which  every  relation  has  different 
output  attributes. 

Let  R  be  the  output  relation  specified  by  the  query  Q. 
If  R  resides  at  a  result  node  then  the  optimum  strategy 
fully  reducing  R^  at  its  site  clearly  answers  Q  with  minimum 
cost.  If  the  result  node  of  the  query  ooes  not  contain  any 
one  of  the  relations  referenced  in  Q  then  the  optimum 
strategy  answering  Q  consists  of  tne  optimum  sequence  of 
semi-joins  fully  reducing  R  at  its  site  and  then 
transmitting  the  projection  of  R^  over  the  output  attrioutes 


. 

''if-  1 
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to  the  result  node. 


The  above  problem  was  studied  by  Hevner  and  Yao[13], 
but  for  a  small  class  of  tree  queries  called  s impl e 
queries.  In  a  simple  query,  each  relation  has  exactly  one 
attribute  and  that  attribute  is  common  to  all  the  relations 
in  the  query. 


Let  S  =  t  R^  ,  .  .  .  ,R?}  be  the  set  of  relations  referenced 
by  Q.  A  transmission  strategy,  denoted  T ( R^, S ) ,  is  a 
sequence  of  semi-joins  fully  reducing  relation  R  with 
respect  to  Q .  A  sequence  of  semi- joins  R^--*^,  R^-^-R^  ,  .  .  .  , 


R 


k_1-->Rk  forms  a  transmission  path  R^ — >• . . . -->R^  from  R^  to 


Rv  of  length  k.  A  relation  R.  may  occur  in  a  transmission 
K  J 

path  more  than  once.  In  order  to  differentiate  between 
different  occurrences  of  the  same  relation,  occurrences  of 
relations  in  a  transmission  path  are  represented  by  distinct 
nodes.  For  example,  R ^ ^  is  represented  by 
N. — — >N  where  nodes  N.,  N,  represent  R.  and  R,  , 

K  "C  J  K  J  K 

respectively,  and  N^_  represents  the  second  occurrence  of  R^ 

in  the  path.  For  notational  convenience,  N.  and  R.  will  be 

used  interchangeably  when  there  is  no  ambiguity.  In  a 

transmission  path  Nj  is  the  immediate  predecessor 

of  N  and  N,  is  the  immediate  successor  of  N • .  If  the 
K.  K  "  J 

transmission  path  P(N.,N-,  )  from  N-  to  Nx  is  of  length  >  1 

J  K  J  K 

then  N  •  is  a  prede ce ssor  of  N,  ,  and  N.  is  a  successor  of  N_. 


J 


Clearly,  if  an  occurrence  of  R^  is  an  immediate  predecessor 


o 


f  an  occurrence  of  R^,  then  R ^  and  Rv  must  have  a  common 


k 


. 


attribute.  However,  one  of  tne  predecessors  of  a  relation  R, 

c  k 

in  a  transmission  path  may  not  nave  an  attribute  common  to 

R^.  If  a  noae  R  has  more  than  one  immediate  predecessor, 

say  iV,...,i\in  ,  t>  1,  all  incominq  semi-ioins,  i.e  R_  — , 

J1  Jt  - - -  3  s  3 

l<s<t ,  are  executed  before  any  outgoing  semi-joins  of  the 
form  JM —  — >JSi t . 

The  following  two  lemmas  prove  some  useful  properties 
of  transmission  paths  that  will  be  utilized  later. 


Lemma  4 . 4  ;  Let  — ^^3  De  a  tr ansmiss ion  path  where  R ^ 

is  an  occurrence  of  R^  l<i<3,  and  R1 ,  R3  have  no  attrioute 
in  common.  Then  any  transmission  path  P(R^,R3)  contains  an 
occurrence  of  R^ . 

Proof :  Let  R^ ,  r^  have  attribute  a  in  common  and  R2 ,  R^  have 

attribute  d  in  common.  Suppose  there  is  a  path  P(N^,R3) 

wnich  does  not  contain  any  occurrence  of  R„ .  Let  N  be  tne 

^  D 

last  occurrence  of  a  relation  with  attribute  a  (see  Fig. 

4.1)  in  the  patn  P(N^,R3),  i.e.  N^=N-^  if  there  is  no  such 
occurrence  other  than  R ^ .  Then  R  ^R2  since  R^  does  not  occur 
in  P(jn^,N3).  R^R^  since  R^  does  not  have  attribute  a.  Since 
r^  is  tne  last  occurrence  of  a  relation  with  attribute  a  in 
P(R1,r3) ,  tne  path  from  h2  to  R^  via  R^  does  not  contain  any 
intermeuiate  relation  having  attribute  a.  This  contradicts 


w i tn  Lemma  4.3(a) . 


# 


. 
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Figure  4.1.  Illustration  for  Lemma  4.4 


a 


Figure  4.2.  Illustration  for  Lemma  4.b. 


79 


Lemma  4.5;  Let  R-^  and  be  two  relations  having  attribute  a 
in  common  and  PClN^,^)  be  any  transmission  path  where  is 

the  only  occurrence  of  in  the  path  and  is  an 

occurrence  of  .  Then  R^  and  its  immediate  successor  in 

P ( Nf , N2 )  also  have  attribute  a  m  common. 

Proof :  Suppose  does  not  have  attribute  a.  Then  there  are 
two  cases: 

Case  1:  N„  and  R„  have  no  attribute  in  common.  Since 
-  u  2 

— >N-^ — is  a  possible  transmission  path,  the  subpath  of 
PlN-^,!^)  from  to  ^  must  contain  an  occurrence  of  R-^  (by 
Lemma  4.4).  This  contradicts  with  being  the  only 
occurrence  of  R^  in  the  path  P^-^,!^)  via  • 

Ca se  2 :  and  R2  have  attribute  c  in  common,  c^a.  Then 

there  is  a  path  between  R^  and  R2  via  where  does  not 
have  attribute  a.  This  contradicts  with  Lemma  4.3(a).  # 

A  transmission  strategy  T(R  , S)  is  represented  by  a 

directed  graph  whose  vertices  are  the  nodes  representing 

occurrences  of  relations  in  the  strategy  and  whose  arcs 

represent  the  semi-joins.  For  notational  convenience,  we  use 

T(R  , s)  to  denote  both  the  transmission  strategy  and  its 

graph  representation.  Let  N  be  the  last  occurrence  of  Rr  in 

T(Rr,S) ,  i.e.  Nr  is  not  a  predecessor  of  any  occurrence  of 

R  .  It  will  be  shown  (Lemma  4.6)  that  there  must  be  a 
r 

transmission  path  in  T(R  , S)  from  an  occurrence  of  every 

relation  R.,  i^r,  in  S  to  N  .  Moreover,  the  semi-joins  in 
j  r 

T ( R  , s )  traverse  a  tree  query  graph  of  an  equivalent  query 


. 

' 


from  the  leaves  to  the  root  R  such  that  the  root  of  each 

r 

subtree  of  QG  is  fully  reduced  with  respect  to  the  subquery 
corresponding  to  that  subtree  by  a  substrategy  of  T(R  ,S) 
(Lemma  4.7).  These  are  illustrated  by  the  following  example. 

Example  4.5;  Consider  the  query  Q  in  example  4.4,  where 
S  =  (R^a),  R2(a,b),  R3 ( a  )  ,  R4(b)}. 

Consider  two  sequences  of  semi-joins  below. 

zi=  <r2-->r4,  r3-->r2,  r4-->r2,  R2— ►R1>  and 

z2=  <R2— >R4,  R3  —  *R2,  R2  — >Rx,  R4-->R2>  . 
z1  is  a  sequence  of  semi-joins  fully  reducing  R-^  with 
respect  to  Q,  i.e.  zi  =  T(R^,S).  The  directed  graph 
representing  T(R-^,S)  is 


where  represents  the  second  occurrence  of  R2  in  the 
strategy.  For  each  relation  Rj,  2^j<4,  there  is  a 
transmission  path  from  an  occurrence  of  R ^  to  the  last 
occurrence  of  R-^ ,  i.e.  ,  in  T  (  R^  ,  S )  .  The  tree  query  graph 
QG  traversed  by  T(R1,S)  from  the  leaves  to  the  root  R]_  is  as 


shown  below. 
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QG : 


Note  that  the  edge  ( )  of  QG  is  traversed  twice  m 
T  ( R^ ,  S )  .  The  subquery  corresponding  to  the  subtree  of  QG 
with  root  R2  is  Qsu]3  =  (R^*b  =  R2  •  b  )  .  and .  (  R^  .  a  =  R2  .  a  )  .  The 
substrategy  of  T(R^,S)  which  fully  reduces  R2  with  respect 
to  the  subquery  Q  is  given  below. 


N- 


Consider  .  The  transmission  paths  in  Z2  are  shown  below 


where  and  represent  the  second  and  the  third 
occurrences  of  R2 •  Note  that,  there  is  no  transmission  path 
from  an  occurrence  of  R^  to  the  last  occurrence  of  R-^  in  Z2  • 
Thus,  zn  is  not  a  transmission  strategy  fully  reducing  R-^ 
with  respect  to  Q .  # 


As  discussed  before,  if  a  sequence  of  N-l  semi-joins  z 


. 


. 


. 
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traverses  a  tree  query  graph  QG  of  an  equivalent  query  from 
the  leaves  to  the  root  R_^  then  z  fully  reduces  with 
respect  to  Q  [3]  .  Before  the  root  R_^  is  fully  reduced  with 
respect  to  Q,  the  root  of  each  subtree  of  QG  is  fully 
reduced  by  z  with  respect  to  the  subquery  corresponding  to 
that  subtree  (see  Example  4.3). 

Lemma  4.6  below  shows  that  any  sequence  of  semi- joins 
fully  reducing  a  relation  R^  necessarily  contains  a 
subs  eg ue nee +  of  z'  with  N-l  semi-joins  such  that  z' 
traverses  a  tree  query  graph  QG  of  an  equivalent  query  from 
the  leaves  to  the  root  R.,  and  the  root  of  each  subtree  of 

l 

QG  is  fully  reduced  by  z‘  with  respect  to  the  subquery 
corresponding  to  that  subtree. 

Let  z  be  a  sequence  of  semi- joins.  Consider  the 
following  algorithm  to  obtain  a  subsequence  z'  of  z  such 
that  z'  fully  reduces  Rr  with  respect  to  Q  and  has  N-l 
semi-joins  if  z  fully  reduces  R  with  respect  to  Q. 

Algorithm  4.1 : 

Input:  A  sequence  z  =  { SJ^ , . . . , I SJ^ }  of  semi-joins 
fully  reducing  R^  with  respect  to  Q. 

+  A  subsequence  z '  of  a  sequence  of  semi-joins  z  contains  a 
subset  of  the  semi-joins  in  z  such  that  their  relative  order 
of  appearance  in  z  is  preserved  in  z'.  For  example 

z'  =  <r3-->r2'  R4  —  *"R2'  R2  *"ri  > 

is  a  subsequence  of  z^  in  Example  4.5,  however  z>2  in  Example 
4.5  is  not  a  subsequence  of  z^  . 


■ 
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Output :  A  subsequence  z'  of  z  such  that  z'  fully 
reduces  R^  and  has  exactly  ini-1  semi-joins. 

(1)  Let  S'={R  },  z'=null  sequence. 

(2)  Starting  from  the  last  semi-join  in  z  tor  each 
semi-join  SJk  =  R^ — repeat 

begin  aelete  SJR  from  z; 

If  R  £  S'  and  6  S '  then 
begin 

S'  :=  S'U  {  R_j  }  ; 
z  '  :  =  S  R  I  z  '  ;  + 

end ; 

eno 

until  all  the  semi-joins  in  z  are  examined. 

(3)  Stop. 

Since  tne  number  of  semi-joins  in  z  is  finite,  tne 
algorithm  terminates.  The  following  lemma  proves  the 
correctness  of  the  algorithm. 

Lemma  4.6:  Let  S  =  {R^,...,R^}  be  the  set  of  relations 
referenced  by  a  tree  query  Q ,  z  be  a  sequence  of  semi-joins 
involving  relations  in  S  and  z'  be  the  subsequence  of  z 
found  oy  algorithm  4.1. 

(a)  If  z  =  { SJ  ^  ,  .  .  .  ,S J ^ )  is  a  transmission  strategy 

T(R  ,S)  then  z'  has  exactly  w-1  semi-joins  ano  fully 
reduces  R^  with  respect  to  Q.  Moreover,  for  every 

+  S .  | z '  denotes  the  concatenation  of  S.  with  z'. 

K  K 


. 
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relation  Rj,  j^r,  in  S,  z‘  contains  a  transmission 

path  from  R  .  to  R  . 

D  r 

(b)  If  z  contains  a  transmission  path  from  every  relation 
Rj,  j^r,  in  S  to  R^  then  z'  also  contains  a 
transmission  path  from  every  relation  R.  to  R  . 

ir  i  j  r 

Proof ;  (a)  Let  zR,  z'k,  and  S'k  denote  z,  z1  and  the  set  S' 

respectively  before  step  (2)  of  algorithm  4.1  is  executed 
for  the  semi- join  SJR  in  z,  l<k<m.  By  backv/ard  induction  on 
k  we  will  show  that  at  each  stage  k  of  the  algorithm,  z^lz'^ 
fully  reduces  R^,  z'R  has  exactly  |S'kl-l  semi-joins  and  for 
each  relation  R^  in  S'k,  jpr,  z'v  has  a  transmission  path 


k 


from  Rj  to  Rr . 


Basis ;  k=m,  trivially  true. 

Induction  step:  Suppose  it  is  true  before  SJR  in  z  is 

examined,  l<k<m.  We  will  show  that  it  is  also  true  after  SJ 

is  examined.  Let  SJ,  in  z  be  R  • — >R.  .  There  are  two  cases: 

k  3  t 

Case  1:  R.0S',  and  R.6S'.  .  Then  R.-->R,  is  deleted  from 
- —  y  k  t  k  j  t 

z,  and  concatenated  to  z‘,  .  That  is  after  SJ,  is  examined 
k  k  k 

zk_x  =  <  SJ1 ,  .  .  .  ,SJk_1>  and  z  'k_1=  SJklz'k.  Since  zklz'R  is 
the  same  as  zk_1lz'k_1,  by  induction  hypothesis,  Zk-1 1 Z 'k-l 
also  fully  reduces  R^  with  respect  to  Q.  s'k_1=  S'k  U  {Rj}. 
By  induction  hypothesis,  z'R_1  contains  a  transmission  path 
from  each  relation  Rg  e  S'k_i,  s^r,  to  Rr-  Since  the  number 
of  serai-joins  in  z'R  and  the  size  of  S'R  are  both  increased 


k 


by  1  after  the  semi-join  SJR  is  examined,  z'k_1#  z‘y- 1 
k-l 


contains  |S'1,_1|-1  semi-]oins  (by  induction  hypothesis). 


I 


b  5 


Case  2:  k_  e  S 1 or  Rj.  £  S',  .  Then  R_ — is  removeu 

■  '  ~  j  K  c  k  J  ~ 


f  rom  z.  .  Dut  z  ' 


ana  o 


are  the  same  as  z\_  ana  S' 


'k' - -  k-1  “““  “  k-1  k  -  k 

respectively,  since  after  the  semi-join  S J ,  is  examinea  only 

z^.  is  changea,  it  suffices  to  show  that  z  k_  ^  I  z  1  k_  ^  fully 

reduces  R^  with  respect  to  Q.  There  are  three  subcases. 

(i)  R_  8  S'  and  R,  €  S',  .  By  induction  hypothesis,  there 

J  K  UK 

is  a  transmission  patn  in  z  '  = z  '  .  from  R  to  R  . 

K  K  —  J.  J 

Similarly,  tnere  is  a  transmission  path  in  z'  ,  from  R.  to 
Rr .  Bence,  P(R  ,Rfc)  =  R^ — • >  . . .-->Rr — > . . .-->Rt  is  a  possible 
transmission  path  in  z'k_-^.  Since  R^ — >Rfc  is  a  semi-join,  R^ 
ana  Rfc  has  an  attribute  in  common,  say  a.  Then  R  [a]  is 
transmittea  from  k^  to  Rr  in  the  subpath  from  R^  to  Rr  of 
the  patn  P(R  ,Rt)(oy  Leirima  4.5),  without  employing  tne 
semi-join  SJ  .  Since  transmitting  R  [ a  J  to  k  more  than  once 

K  J  f 

aoes  not  cause  any  further  reauction  on  kr ,  if  z ^ | z  '  R  fully 
reauces  k  with  respect  to  Q,  so  aoes  zk_jJz'k  after  tne 
semi-join  k^-->Rfc  is  removea  from  z^. 

(ii)  k^es'^  and  Rt£S'R.  Rfc  is  not  transmittea  to  Rr  aft<^r 
the  semi-join  R^ — >Rfc  in  zRlz'k.  hence,  the  semi-join  SJK  is 
not  usea  in  z,  |z'k  in  fully  reaucing  Rr .  Therefore,  zk_jJz\ 
also  fully  reauces  kr  witn  respect  to  Q . 


(iii)  k  £S '  K  ana  kt£S'k.  Then  neitner  Rt  nor  R^ 


is 


transmittea  to  Rr  in  zklz'K  after  the  semi-join  R^ — >Rfc  is 
performed,  bince  z^lz'^  fully  reauces  R^  by  inauction 
nypothesis,  ZK-±\Z\  also  fully  reauces  Rr . 


8b 


when  the  algoritnm  terminates,  there  are  no  semi-joins  left 
in  z,  i.e.  z  ^  is  a  null  sequence  ana  z^ | z  1  ^  =  z  '  ^  =  z  1  .  Bence  z' 
fully  reauces  R  with  respect  to  Q.  Since  z1  contains 
exactly  IS'^I-I  semi-joins  ana  fully  reauces  Rr  with  respect 
to  Q ,  S'^=S.  Thus  z'  contains  L-l  semi-joins  ana  z'  contains 
a  transmission  patn  from  eacn  relation  R^  in  S,  j^i,  to  Rr> 
(o)  The  proof  is  s traign tf orwara  using  inauction  on  tne 
numoer  of  semi-joins  examinee  similar  to  part(a).  * 

Lemma  4.7:  Let  Q  be  a  query  with  N  relations  ana  z '  be  a 
sequence  of  im-1  semi-joins.  If  z'  contains  a  transmission 
patn  from  eacn  relation  R^ ,  j^r,  to  R^  then  there  exists  a 
tree  query  grapn  Q<o  of  an  equivalent  query  traversed  by  z' 
suen  that  the  root  of  each  subtree  of  QG  is  fully  reaucea 
witn  respect  to  the  subquery  cor responaing  to  that  subtree 
by  z  '  . 

* 

Proof:  Since  QG  contains  an  eage  Detween  any  two  relations 

having  a  common  attribute,  if  a  semi-join  R_-->R  is  in  z' 

J  K 

* 

tnen  tne  eage  (R_,R.  )  is  in  QG  .  Since  tnere  is  a 

J  K 

transmission  patn  from  eacn  relation  R^  fe  S,  j^r ,  to  Rr  in 
z',  tnere  is  a  one-to-one  cor responoe nee  between  tne 

X 

semi-joins  in  z'  ana  tne  eages  of  a  spanning  tree  of  QG  .  oy 

* 

Lemma  4.3(b),  any  spanning  tree  of  Qb  is  a  tree  query  grapn 
of  an  equivalent  query,  hence  there  is  a  one-to-one 
cor responaence  between  the  semi-joins  in  z'  ana  the  eages  of 
a  tree  query  graph  QG  of  an  equivalent  query,  for  each 
relation  R^ ,  j^r,  there  is  exactly  one  transmission  patn 


■ 

IL 


87 


from  R_j  to  Rr  in  z'  ana  tne  semi-joins  are  executed  in  the 
oraer  of  their  appearence  in  tne  patn.  Thus  z'  traverses  QG 
from  tne  leaves  to  tne  root  R^  sucn  that  the  root  of  each 
suotree  of  Qci  is  fully  reaucea  with  respect  to  tne  suoquery 
cor  responding  to  that  suotree.  # 


The  following  proposition  gives  the  necessary  condition 
to  oe  satisfied  Dy  a  sequence  of  semi-joins  fully  reducing  a 
relation  with  respect  to  Q. 


proposition  4.1;  Let  T(Rr,S)  be  a  transmission  strategy 
fully  reducing  a  relation  R^  with  respect  to  a  tree  query  Q , 
where  S  is  tne  set  of  relations  referenced  by  Q.  Then 

(a)  For  every  relation  R^  in  S,  j^r,  there  is  a 
transmission  path  from  N  ^  to  N r  in  T(Rr,S),  where 

is  an  occurrence  of  R^  and  Nr  is  the  last  occurrence 
of  R^  in  T (Rr ,S) . 

(b)  There  is  a  tree  query  grapn  QG  of  an  equivalent  query 
sucn  that  T(R  ,S)  traverses  Qvo  from  tne  leaves  to  tne 
root  R^,  ana  tne  root  of  each  suotree  of  QG  is  fully 
reduced  with  respect  to  tne  suoquery  corresponding  to 
that  suotree  by  a  substrategy  of  T(R  ,S). 

Proof:  Directly  follows  from  Lemmas  4.6  ana  4.7.  # 


Let  Q  be  a  tree  query  ana  S  =  {R, ,...,RN)  be  the  set  of 
relations  referenced  by  Q.  Consider  the  attributes  of  a 
relation  R,  6  S  that  are  named  in  the  qualification  of  Q. 


, 


1  " 


Each  such  attribu 
otner  relation  in 

conjunction  of  jo 

* 

to  represent  Q  s 
iff  relations  R. 

i 

common.  Clearly, 
relations.  Hence, 
term  "fully  reauc 
with  respect  to  t 


te  of  R^  must  be  common"1”  to  at  least  one 
S  since  the  qualification  of  Q  is  a 

in  clauses.  Thus,  the  set  S  is  sufficient 

* 

ince  the  join  clause  (Ri.a=R_).a)  is  in  Q 
and  are  in  S  and  have  attribute  a  in 
equivalent  queries  have  the  same  set  S  of 
witn  no  loss  of  generality,  we  can  use  the 
eo  over  the  set  s"  to  mean  "fully  reduced 
he  query  Q" . 


Two  transmission  strategies  are  equivalent  if  both 
strategies  fully  reduce  the  same  relation  over  the  same  set, 
S,  irrespective  of  the  contents  of  the  relations  in  S. 
Clearly,  the  optimum  strategy  fully  reducing  a  relation  over 
S  is  the  one  with  minimum  cost  in  the  set  of  all  equivalent 
strategies. 


4 . 3  Potentially  Optimum  Strategies 

A  transmission  strategy  T‘(R  ,S)  fully  reducing  a 
relation  over  a  set  S  is  r edundant  if  there  is  a  lower  cost 
equivalent  strategy  T ' (R  ,S) ,  i.e. 

Cost ( T 1 ( R  , S ) )  <  Cost(T(Rr,S)  irrespective  of  the  contents 

of  the  relations  in  S  and  there  is  at  least  one  database 
state  for  relations  in  S  for  which 

Cost (T1 (Rr ,S)  <  Cost (T(Rr ,S) ) .  In  this  section,  we  indicate 

+The  common  attributes  nave  the  same  name,  and  any  two 
relations  have  at  most  one  attribute  in  common  (see  Section 
4.1). 


' 


* 


the  configurations  of  potentially  optimum  strategies  which 
are  obtained  by  eliminating  the  redundant  strategies  from 


89 


the  set  of  all  equivalent  strategies. 

Exampl  e  4.6:  Let  S  =  {  R-^  (  a  )  ,  R2 ( a  )  ,  R^Ca/b),  R^(b)}  be  the 
relations  referenced  by  a  query  Q.  Consider  the  two 


where  N'^  represents  the  second  occurrence  of  R^  in  both 
strategies.  Cos  t(  ^  )  =  Cos  t(  )  .  However,  the 

cost  of  N~ — >N2 — ^^'3  ls  lower  in  T2(R^/S)  than  in  T^(R^,S) 
since  Rh  and  N2  are  reduced  by  in  T2(R^#S)  but  not  in 
(  R^  ,  S  )  .  Consequently,  Cos  t  (  T2  (  R^  /  S  )  )  <  Cos  t  (  (  R-^  ,  S  )  ) 

irrespective  of  the  contents  of  the  relations  in  S.  Hence, 
T^(R^,S)  is  a  redundant  strategy.  ff 

Below  a  set  of  properties  that  must  be  satisfied  by  a 
potentially  optimum  strategy  is  presented.  These  properties 
allow  us  to  discard  redundant  strategies  and  indicate  what  a 
potentially  optimum  strategy  looks  like. 


. 

■ 


Property  1;  if  is  a  semi-join  in  T(Rr,S)  tnen  N ^  and 

Ini  are  occurrences  of  different  relations. 

Property  2:  If  P^N-^N)  ano  P2(N2,N)  are  two  nonover  lapping  + 
transmission  paths  in  T(Rr,S)  intersecting  at  N  tnen  the 
relations  represented  by  ano  lSi ^  oo  not  have  a  common 
a ttr ibu te . 

Property  3:  Tne  outdegree  of  every  nooe  in  T(Rr,S)  except 
the  result  nooe,  isi  ,  representing  the  last  occurrence  of 
is  one  and  the  outaegree  of  the  result  node  is  zero. 

The  first  property  is  obvious  since  otherwise  T(Rr,S) 

contains  a  reounoant  semi-join,  i.e.  removing  IM ^ — > lSi j 

results  in  a  lower  cost  equivalent  strategy.  Property  2 

implies  that  if  any  two  relations  R^ ,  r^  have  an  attribute 

in  common  tnen  all  occurrences  of  R^  and  must  be  in  a 

single  transmission  path  in  T(R  ,S).  Let  Nr  be  the  result 

nooe  of  T(R  ,S).  There  is  a  directed  path  in  T(Rr,S)  from 

every  nooe  lb  to  Nr  (by  prop.  4.1).  Then,  property  3  implies 

that  T(R  ,S)  is  an  in tree  [12J  with  root  Nr.  Let  N^,...,Nt, 

t>l,  be  the  immediate  predecessors  of  Nr  in  T(R  ,S)  and  S^, 

l<i<t,  be  the  set  of  relations  represented  by  the  nooes  in 

the  suotree  with  root  N ^  (see  figure  4.3).  Each  set  S^, 

*  * 

l<i<t,  represents  a  subquery  of  Q  where  Q  is  the  query 

*  * 

represented  by  the  set  S.  Let  Q  ^  be  the  suoqu^ry  of  Q 

+Two  transmission  patns  are  nonove  r lappi ng  if  there  is  no 
arc  contained  in  both  paths. 


■ 


*1 


T(Kr ,S) 


figure  4.3.  A  btrategy  T(Rr,S)  ana  its  suds tra teg ies , 
T ( R^ , S  ^ )  t  i-i »  2 , . . . ft. 
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represented  by  the  set  S  ,  l<i<t.  Since  Q  is  a  tree  query, 

* 

Q1  is  also  a  tree  query.  Since  the  subtree  contains  a 
transmi ss ion  path  from  every  relation  in  S  to  the  last 

occurrence,  Nl,  of  R^,  the  subtree  of  T(R  ,S)  with  root  R(i) 

* 

is  a  substrategy  fully  reducing  with  respect  to  Q  ^(by 
Lemma  4.6(b)  and  Lemma  4.7).  This  is  illustrated  in  Example 
4.7  below. 


Example  4.7;  Consider  the  set  S  =  {  R^  (  a  )  ,  R£ ( a  )  ,  R^(a,b), 

R^(b)}  and  the  strategy  T(R-^,S)  given  in  Example  4.6.  The 
* 

query  Q  represented  by  the  set  S  is 

* 

Q  =  (  R^  .  a  =  R2  •  a )  .  arid.  (  R-^  .  a  =  R,  ■  a  )  .  and  .  (  R^  •  a  =  R3  .  a ) 

. and .  ( R3  .  b  =  R^ . b ) . 

The  subtree  of  T^(R^,S)  with  root  N' 3  is  over  the  set 
S3  =  lR2(a),  R3 ( a , b ) ,  R4(b)}. 

The  subquery  Q  3  of  Q  which  is  represented  by  S3  is 

* 

Q  3  =  (  R2  .  a  =  R3  .  a  )  .  and .  (  R3  .  b  =  R4  •  b  ) 

The  substrategy  T(R3,S3)  and  the  tree  query  graph  traversed 
by  the  substrategy  are  shown  below. 


Properties  2  and  3  are  proved  by  the  following  two 


1 emma  s . 


' 

. 
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Lemma  4.8;  A  potentially  optimum  strategy  must  satisfy 
prop erty  2 . 

P r oof ;  Let  P^(N^,N)  and  P2(N2»N)  be  two  nonoverlapping  paths 
intersecting  at  N  such  that  ,  N2  are  occurrences  of  R-^  ,  R2 
which  have  attribute  a  in  common.  Let  and  N4  be  the 
closest  nodes  to  N  in  P-^L-^N)  and  P2(Lt2/N),  respectively, 
that  represent  relations  with  attribute  a.  It  suffices  to 
show  that  there  is  a  lower  cost  equivalent  strategy  having 
N^j  and  as  predecessors  of  N  in  a  single  path.  There  are 
two  cases. 

Case  1 :  N  is  the  immediate  successor  of  both  and 
(see  Fig.  4.4-(a)).  Since  and  have  attribute  a  in 
common,  and  — >-N — >N2  is  a  possible  transmission  path, 

and  N  also  have  attribute  a  in  common  (by  Lemma  4.5).  Since 
any  two  relations  have  at  most  one  common  attribute,  the 
data  transmitted  from  to  N  by  N^-->N  must  be  N^[a].  Then 
replacing  — >N  by  N^-->N2  results  in  a  lower  cost 

equivalent  strategy  since  Cost(N^--^th)  =  Cost(N^-->N)  while 
Cost(N2-->N)  is  lower  in  the  new  strategy  because  is 
reduced  by  the  relations  in  the  path  from  1^  to  as  well. 
Case  2:  N  is  not  the  immediate  successor  of  both  and 
(see  Fig.  4.4-(b)).  Without  loss  of  generality,  let 
has  as  its  immediate  successor  in  P(N-^,N)  which 
represents  .  Consider  the  path 

N-,  — — >  •  •  • — >N — >  .  .  .  — >N.,  .  Such  a  path  is  possible  since 
3  5  4- 

N . — >N ■  permits  N.-->N..  If  LL  is  the  only  occurrence  of  R^ 
l  j  ^  J  i  5  ^ 

in  the  path  then  R^  must  have  attribute  a  (by  Lemma  4.5). 


' 


y  4 


(d)  Case  2 


Figure  4.4.  Illustration  for  Lemma  4.u. 
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This  will  contradict  with  ft^  oeing  the  closest  nooe  to  ft 

having  attribute  a.  Tnus,  occurs  more  than  once  in  the 

path.  Since  ft„  ana  ft.  are  the  closest  nodes  to  LSI  in  the  path 

having  attribute  a,  l\  must  be  an  occurrence  of  Rn  .  L^t  h,  be 

5  6 

the  immediate  predecessor  of  ft  in  P  (N2,ft).  Since  isi ^  and  ft 

are  the  occurrences  of  the  same  relation,  i\i  ,--->ft  can  be 

b 

replaced  by  ^--^ft^  resulting  in  an  equivalent  strategy  in 
which  w ^  and  ft^  are  predecessors  of  ft  in  tne  same  path.  The 
new  strategy  has  a  lower  cost  since  Cost(i\i^ — >ft) 

=  Cost(jSi^ — ^^3)  while  the  cost  of  the  path  from  ft ^  to  ft  in 
the  new  strategy  is  lower.  # 

Lemma  4.9:  A  potentially  optimum  strategy  must  satisfy 
property  3. 

Proof ;  Let  ft^.  be  the  result  nooe  of  a  strategy  and  ft ^  be  any 
other  node  such  that  outdegree (ISL  ) =2 .  It  is  sufficient  to 
show  that  this  strategy  can  not  be  optimal.  There  are  two 
transmission  paths  from  i\R  to  .  Let  N  be  the  first  node 
where  the  two  paths  P^(ft.,ftr)  and  P2(ft^,ftr)  intersect  (N  may 
be  ftr) .  There  are  two  cases: 

Case  1:  The  immediate  successors  of  ISR  in  both  paths  are 
tne  same  nooe,  i.e.  ft  (see  Fig.  4.5-(a)).  Then  one  of  the 
semi-]oins  ft ^ — >ft  is  reounoant  and  can  be  removed,  resulting 
witn  a  lower  cost  equivalent  strategy. 

Case  2:  The  immediate  successors  of  are  two  different 
nodes,  ft.  and  ft 2  •  By  Property  2,  ft-^  ano  ft2  ao  not  have  any 
attribute  in  common.  Since  — >ft^ — ^iNI2  a  Poss^ble 


. 
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( b )  Ca  se  2 


figure  4.5.  Illustration  for  Lemma  4.9. 
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transmission  path,  P(N1,N2)  =  j^--* . . . — — > . . . — >h2  must 
contain  an  occurrence  of  Ri  (by  Lemma  4.4).  If  the 
occurrence  of  R.  in  P  ( N  ,  N  n )  is  between  tR  ana  R  then 
property  2  is  violated  since  R^  ana  N 2  have  a  common 
attribute.  Similarly,  the  occurrence  of  R^  in  p(N^,N2)  can 
not  be  oetween  N  and  tR.  Thus,  R  must  be  an  occurrence  of 
R^ .  Since  R ^  ana  M  are  the  occurrences  of  tne  same  relation, 
an  equivalent  strategy  can  be  obtainea  by  replacing  N  with 
two  occurrences  INI'  and  N"  of  R^  and  replacing  R^-->N2  and 
tne  patn  from  R 2  to  N  by  N'-->N2  ana  the  path  from  N2  to  N" 
respectively  (see  fig.  4.5-(b)).  Since 

Cost(lM, — >1m2)  a  Cos  t  ( N  ^  — >N  2 )  ana  the  nodes  along  the  path 
from  i* 2  to  N"  are  reauceo  by  aaditional  relations,  the 
resulting  strategy  has  a  lower  cost.  # 

The  following  property  states  tne  conditions  to  be 
satisfied  by  substrategies  of  a  potentially  optimum  strategy 
T(Rr ,S) . 

Property  4:  Let  fR,...,JMt,  t>l,  be  the  immediate 
predecessors  of  Rr  in  T(R  ,S),  and  S^  be  the  set  of 
relations  represented  by  the  nodes  of  tne  subtree  with  root 
IM^ ,  l<i<t.  Then 

(a)  Inaegr ee  (N  )  <  the  number  of  attributes  of  Rr  in  S, 

(b)  If  inaegr ee (N  ) =t ,  t>l,  then  the  sets  S-,  ,  .  .  .  ,S.  are 

r  t  1 

disjoint  subsets  of  S-{R  }  such  that  U  S-=S-{R  }, 

i  =  l 

(c)  If  T(R^,S^)  is  a  substrategy  of  T(Rr,S)  then  T ( R  ^ , S  ^ ) 
must  be  a  potentially  optimum  strategy  fully  reducing 


V 

' 
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over  S^. 

Lemma  4.1ld:  A  potentially  optimum  strategy  T(R  ,S)  must 
satisfy  property  4. 

Proof :  (a)  Suppose  inaeg ree )  is  greater  tnan  the  number 

of  attributes  of  R^  in  S.  Then  there  are  at  least  two 
immediate  preaecessors  in  ,  i^j  ,  of  in r  in  T(Rr,S)  sucn 

that  N ^  and  h  nave  a  common  attribute.  This  violates 
property  2  since  ism  and  are  in  two  nonoverlapping  paths, 
ISU-->Nr  and  N  -->Nr . 

(b)  If  the  sets  S^,...,St,  t>l,  are  not  disjoint  then 
there  is  a  relation  R^  contained  in  two  or  more  sets,  say 
S-,  S  ,  i^j,  l<i,j<t.  Let  N ,  and  N'  be  the  occurrences  of 

1  J  *”  K  K 

R,  in  S-  and  respectively.  Since  the  path  from  iSI,  to  N 

K  1  J  K  l 

ano  the  path  from  im  '  to  N  are  two  nonoverlapping  paths, 

k  r 

property  2  is  violated.  Similarly,  R  6  S • ,  l<i<t,  by 

"t 

property  2.  Since  R  is  fully  reduced  over  S,  U  S-  =  S-{R  } 

r  i=l  1  r 

(by  prop.  4.1). 

(c)  If  a  substrategy  T(R^,S^)  of  T(Rr,S)  is  not 

potentially  optimal  tnen  replacing  T(R^,S^)  with  a  lower 
cost  equivalent  strategy  results  in  a  lower  cost  strategy 
equivalent  to  T(R  ,S);  a  contradiction.  # 

At  this  point  we  want  to  differentiate  a  single 
attribute  relation  from  a  multiattribute  relation  because 
they  satisfy  different  properties.  A  relation  R^  is  a 
multiattribute  relation  in  S  if  R^  has  more  than  one  joining 
attribute  in  S;  otnerwise  it  is  called  a  single  attribute 
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r elat ion  in  S .  Let  T(R^,S^)  be  a  substrategy  of  T(R  ,S).  It 
is  possible  that  a  multiattribute  relation  in  S  may  become  a 
single  attrioute  relation  in  as  illustrated  in  the 
following  example. 

Example  4.8:  Consider  the  strategy  T(R^fS^)  in  example  4.7, 
wnere  =  ={R2(a),  R^(a,b),  R^(b)}.  Tne  substrategy  of 
T(R^,S-j)  with  result  nooe  N 2  is 

where  S2  =  {R2(a),  R2(a)j. 


R^  has  only  one  joining  attribute,  namely  a,  in  S2  since  no 
otner  relation  in  S2  has  attribute  b.  That  is,  R^  wnich  is  a 
multiattribute  relation  in  becomes  a  single  attribute 


relation  in  S~ 


# 


By  property  4,  inaeg r ee  ( £g  )  is  bounoeo  oy  the  number  of 
attrioutes  of  R^  in  S.  If  Rr  is  a  multiattribute  relation  in 
S  then  may  have  one  or  more  immediate  predecessors. 
Property  5,  below,  describes  the  substrategies  of  a 
potentially  optimum  strategy  T(Rr,S)  when  the  result  node 
has  more  than  one  immediate  predecessor. 

Property  5:  If  Rr  has  t  attributes  in  S  and  indeg r ee (Nr ) >1 
in  T(Rr,S)  then 

(a )  Inaegr ee (tt  ) =t 

(b)  Tne  partition  S^,...,St  of  is  un  ique ,  where 


is  the  set  corresponding  to  the  substrategy  of 
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T(Rr,S)  with  result  noae  which  is  an  immeaiate 
predecessor  of  r  ,  l<i<t. 

Lemma  4.11;  A  potentially  optimum  strategy  T(R  ,S)  satisfies 
p  r  op  e  r  ty  5  . 

Pr oof :  (a)  Let  indegree (N  )  =  k>l.  Suppose  k<t.  Since  any 

two  relations  can  have  at  most  one  attribute  in  common, 
tnere  is  at  least  one  attribute  of  R  ,  say  a,  which  is  not 
common  to  any  immeaiate  predecessor  R^  of  Nr ,  l<i<k.  Since  a 
is  a  joining  attribute,  there  is  at  least  one  relation  R  , 
u^r  ,  (see  Figure  4.6)  in  a  subtree  witn  root  r  ,  l<J<k,  sucn 
that  Ru  nas  attribute  a.  Since  T(R  ,S  )  is  a  suostrategy 
with  result  noae  N  ana  R  6  S_ ,  there  is  a  transmission 

J  ^  J 

path  P(R  ,N  )  in  T ( R  , S  )  (by  Prop.  4.1).  If  R  ana  R  have 

^  J  J  J  ^  J 

no  common  attribute  then  P(Ru,Rj)  must  contain  an  occurrence 

of  R^  (by  Lemma  4.4)  since  Nu~->Nr-->R^  is  a  possible 

transmission  path.  But  this  contradicts  with  R^gs^  (by 

property  4-(b)).  Thus  R^  ana  R^  must  have  an  attribute  in 

common.  Since  R^  ana  R^  have  attribute  a  in  common  and 

R  — > N  -->R  is  a  possible  transmission  path,  R.  must  have 
u  j  r  r  j 

attribute  a  (by  Lemma  4.5) .  This  is  a  contraai ct ion  since  R  ^ 
aoes  not  have  attribute  a.  Thus  indeg ree (R  ) can  not  oe  less 
tnan  t.  Ihen  inaegree (R^. ) =t  since  it  can  not  be  greater  than 
t  (by  property  4-(a)). 

(b)  Let  {N-,  ,  .  .  .  ,Nt)  ana  { R '  ^  ,  .  .  .  ,i\i 1 1  j  be  the  sets  of 
immeaiate  predecessors  of  R  in  two  different  partitions 
{S1,...,Sti  ana  iS'1,.--,S,c>  of  S-{Rri  respectively  (see 


. 


(a)  Case  (a) 


(t>)  Case  (d) 


Figure  4.6.  Illustration  for  Lemma  4.11. 
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Fig.  4.6— (Jd)  )  .  By  proof  of  part(a),  c  =  t  and  each  N 1  ^  has  a 
distinct  attribute  a^  in  common  with  N  ,  l<i<t.  by 
rearrangement  of  indices,  if  necessary,  b^  and  b '  •  nave 
attribute  a^  in  common,  l<i<t.  By  property  2,  b ^  and  b^  i/k  , 
l<i,k<t,  have  no  attribute  in  common.  Since  b^-->br-->bj<  is 
a  possible  transmission  path,  any  transmission  path  from  R^ 
to  must  contain  an  occurrence  of  Rr  {oy  Lemma  4.4) .  we 

now  show  that  this  will  be  contradicted  it  S^S'^  for  some 
l<i<t.  The  inequality  of  two  sets  implies  that  there  is  a 
relation  R^  e  S^  such  that  R^  £  s'i*  Since 
t  t 

U  S-  =  U  S'-  =  S- { R  }  ,  there  is  a  set  S'  such  that 

•  T  X  •  T  1  X  K 

1=1  1=1 

R  6S',  for  some  k^i,  l<k<t.  Let  N  and  N'  be  the 

LI  K  ”*  ”*  LI  LI 

occurrences  of  R  ,  and  P.  (b  ,N-)  and  P0(b',  ,b'  )  be  the  two 

U  X  U  1  Z  U  K 


transmission  paths  in  S.  ano  S'  respectively.  Then  the  path 
from  b^  to  bu  (  the  reverse  of  P^(bu,b^))  joined  with 
b  -->b'  ,  P<-  (b'  ,b'Lr)  ,  b'  — >Ini ^  forms  a  possible 

U  LI  Z  U  K.  K.  K. 

transmission  path  from  R.  to  R,  .  This  path  does  not  contain 

X  K 

Rr  since  neither  S-  nor  s'^  contains  Rr ;  a 
contradiction.  ft 


By  property  4-(a),  if  Rf  has  t  attributes  in  S,  then 

indegree (b  )  <  t.  By  property  5(a),  if  indegree(br)  >  1  then 

it  must  oe  t.  If  indegree (N  ) =t  in  T(R  ,S)  then  S-{Rrj  is 

partitioned  into  t  disjoint  subsets  S-^,...,St  each 

*  .... 

representing  a  subquery  of  Q  and  the  partition  is  unique, 
oy  properties  4(b)  and  5(b).  The  following  property  helps  to 


. 

' 
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identify  the  immediate  predecessor  of  N  in  T(Rr,S)  v/hen  Rr 
is  a  multiattribute  relation  in  S  but  indegree  ( N  )  =1 . 

Property  6  :  Let  T(R^,S)  be  a  substrategy  of  TlR^S^)  such 

that  R  is  a  multiattribute  relation  in  S  and  R  is  the 
r  P 

immediate  successor  of  R  .  If  indegree (N  ) =1  and  R.  is  the 

r  ^  r  l 

immediate  predecessor  of  R  then 

r 

(a)  the  substrategy  T(R^,S^)  satisfies  S1=S/ 

(b)  the  attribute  a  that  is  common  to  R  and  R  is  not 

pr  p  r  - 

the  same  as  the  attribute  a  ^  that  is  common  to  Rr 
and  R 


Lemma  4.12;  A  potentially  optimum  strategy  T(Rj,Sj) 
satisfies  property  6. 


Proof :  (a)  Since  R^  is  a  multiattribute  relation  in  S,  there 

is  at  least  one  attribute  a  ,  .  a  ,  ^  a  . ,  of  R  such  that 

rk  rk  n  r 

a^  is  also  an  attribute  of  a  relation  R^.,  k^i ,  in  S  and 

T(R  ,S)  contains  an  occurrence  N,  of  R,  .  Since  a  .  is  an 
r'  k  k  n 

attribute  of  R.,  and  R  and  R.  can  have  at  most  one 

l  r  l 

attribute  in  common,  R  does  not  have  a  .  Thus  R,  ^R.  . 

1  IT  K  K  1 

Similarly,  R  does  not  have  attribute  a  .  of  R  because 

KL  IT  1  IT 

otherwise  R  and  R,  would  have  two  attributes  in  common, 
r  k 


Then  the  occurrence  of  R^  must  be  in  the  substrategy 
T ( R^ , S 1 )  with  result  node  since  indegree ( Nr ) =1 .  Suppose 


R.  and  R,  have  an  attribute,  say  b,  in  common.  Since  any  two 
1  K 

relations  have  at  most  one  attribute  in  common,  fr^a^  an<3 
b^a^  .  Since  is  a  possible  transmission  path, 


r  l 


R  and  R.  must  have  attribute  b  in  common  (by  lemma  4.5) 
k  l 


. 
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This  is  a  contradiction  since  b^a  ,  .  Thus  R.  and  R  nave  no 

r  k  i  K 

attribute  in  common.  Since  T(R^,S-  is  a  strategy,  there  is  a 

transmission  patn  P(Nk,Ni)  in  T(Ri,Si)  (by  prop.  4.1)  .  By 

Lemma  4.4,  P(d.  ,Fi.)  contains  an  occurrence  of  N '  of  R, 

i\  1.  r  r 

since  NK_-^dr  ~~>Fi^  is  a  possible  transmission  patn. 

Therefore,  Si  contains  Rr ,  i.e.  Si=S. 

(b)  From  part(a),  there  is  at  least  one  occurrence  d  1  r  of 

Rr  in  T(R^,S).  Consider  the  transmission  path  P(d'r,d^)  in 

tne  substrategy  T(R^,S).  with  no  loss  of  generality,  let  d 1 

oe ’the  last  occurrence  of  Rf  in  the  path  P(N'r,M^).  Then  oy 

Lemma  4.5,  Rr[ar^J  is  transmittea  to  l\l ^  via  the  immediate 

successor  of  N'r  in  the  path  P(N'r,Ni) .  Suppose  apr=ari* 

Consider  the  strategy  obtained  by  replacing  d.-— >N  -->N  in 

x  r  p 

T(R  ,S  )  by  .  The  new  strategy  (see  Figure  4.7)  is 

p  p  x  p 

equivalent  to  T(R  ,S  )  since  R  [a  .  J  transmittea  to  Fi  by 

p  p  r  r  i  p 

Fl  --Mm  is  also  tr ansmi ttea  to  FI  via  the  transmission  path 
r  p  p 

from  lm  '  r  to  dp  (i.e.  P(d'r,d^)  joined  with  d^-->dp)  in  the 

new  strategy.  Since  the  cost  of  the  new  strategy  is  lower, 

this  contradicts  with  the  optimality  of  T ( R  ,S  ).  # 

P  P 

Properties  1-6  are  satisfied  by  all  potentially  optimum 
strategies  irrespective  of  the  size  estimation  method,  we 
now  give  a  property  which  uses  assumption  1  (see  Section 
4.2).  Consider  two  single  attribute  relations  R^ ,  R^  in  S, 
having  the  same  joining  attribute  a.  Let  |R^[a]|  <  I R  [ a ] | . 

By  Property  2,  there  is  one  transmission  path  in  T(Rr,S) 
containing  all  relations  with  attribute  a.  The  following 


. 


10  5 


Figure  4.7.  Illustration  for  Lemma  4.12. 


property  says  tnat  each  single  attribute  relation,  except 
possibly  the  result  relation,  occurs  exactly  once  ana  in 
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ascenaing  order  of  size,  i.e.  and  are  in  size  order  if 
R^  is  a  predecessor  of  R^ . 

Property  7:  Let  T(R  ,S)  be  a  strategy  fully  reducing  R  over 
S,  and  a  be  an  attribute  of  R  .  Then 

(a)  Every  single  attribute  relation  in  S  (except  possibly 
R  ) ,  occurs  exactly  once  and  in  size  order  in 

T(Rr ,S) . 

(b)  Let  R^  be  a  single  attribute  relation.  If  Rr  is  the 
largest  relation  among  single  attribute  relations 
having  attribute  a  tnen  the  result  node  Nr  of  T(k  ,S) 
is  the  only  occurrence  of  kr  ;  otherwise,  in  addition 
to  N  ,  the  substrategy  of  T(Rr,S)  may  contain  an 
occurrence  of  R  in  size  order. 

Lemma  4.1'3:  A  potentially  optimum  strategy  T  ( R  , S )  satisfies 
property  7  . 

Proof ;  (a)  Let  N  ^ — > . . . — .  . .  - — >N  r 
be  a  transmission  path  in  T(Rr,S).  Suppose  are 

occurrences  of  a  single  attribute  relation  R^  with  attribute 
a.  The  relations  represented  by  In,  ,  In  must  have  attribute 
a.  Then  replacing  by  results  in  a  lower 

cost  equivalent  strategy  since  R^  [ a ]  is  transmitted  to  Nm  in 
the  new  strategy  with  a  lower  cost.  Thus,  single  attribute 
relations,  except  R^ ,  can  not  occur  more  than  once  in  a 
transmission  path,  since  all  occurrences  of  a  relation  are 


\ 


i  1  5 
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in  a  single  path  (by  property  2)  ,  there  is  exactly  one 
occurrence  of  a  single  attribute  relation  (except  R  )  in 
T(Rr,S).  Suppose  N  are  occurrences  of  single  attribute 

relations  R^,  r_^  with  attribute  a  in  common  ana 
lR^[a]l  >  |R^[aj|.  Then  interchanging  i\i^  ana  N ^  results  in  a 
lower  cost  equivalent  strategy  oy  assumption  1.  This 
contraaicts  with  the  optimality  of  T(R  ,S). 

(b)  Let  R^  pe  a  single  attribute  relation  in  S  witn 
attribute  a.  Then  the  arguments  in  (a)  are  also  applicable 
to  all  occurrences  of  R^.  in  T'(Rr,S)  except  the  result  noae 
Nr .  Therefore,  if  Rr  is  not  the  largest  single  attribute 
relation  with  attribute  a  then  T(Rr,S)  may  contain  another 
occurrence  of  R  in  size  order  in  adaition  to  N r .  # 

The  following  example  illustrates  Property  7. 

Example  4.9:  Let  g  be  a  simple  query  (13]  with  relations 
S  =  {R1 (a) , . . . ,RN (a) } .  Suppose 

lR1(a]|  <  I  R2  L  a  J  |  <...<  |RNla]|.  Let  T(Rr,S)  be  the  optimum 

strategy  fully  reaucing  Rr  over  S.  In  [13],  it  is  shown 
that,  if  I Rr [a]  | = I R^ [a]  |  then  T(Rr,S)  is  R^ — >R2 — > • . • -->Rr . 
Othe  rwise , 

either  R, — >Rr_2 — ^r+1 — ^  •  •  • — (4.1) 
or  R^-->... — >Rr_^-->Rr~-tRr +  .  .  . — >R^-->Rr  (4.2) 

is  the  optimum  strategy  T(R  ,S). 

If  | R  [a]  | = | RN [a J  I  then  R± — >R2 — >... — >Rr  is  the  only 
potentially  optimum  strategy  satisfying  properties  1-7, 


■ 


' 


. 
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hence  it  is  the  optimum  strategy.  Similarly,  if 
I [a J I < | [ a J  |  tnen  only  (4.1)  ana  (4.2)  satisfy  tne 
properties  1-7  ana,  hence,  one  of  them  is  the  optimum.  # 

For  a  simple  query  Q  with  M  relations,  there  are  at 
most  two  potentially  optimum  strategies  fully  reaucing  a 
relation  with  respect  to  Q.  However,  for  slightly  more 
general  queries,  the  numoer  of  potentially  optimum 
strategies  is  exponential  in  spite  of  the  restrictions 
imposed  by  properties  1-7,  as  illustrated  oelow. 

Example  4.10:  Let  Q  oe  a  query  with  a  set  of  relations 
S  =  iRj_  (a)  ,  .  .  .  (a)  ,Rm+1  (b)  ,  .  .  .  ,Rm+n  (b)  ,Rm+n  +  1  (a  ,b)  j  . 

Suppose  the  result  relation  is  R^ .  since  Rm+n  +  g  is  the  only 
relation  with  two  attributes,  in  T(Rr,S),  only  Rm+n+1  can 
nave  more  than  one  immeaiate  predecessor  (by  property  4(a)). 
Let  i\i I ana  be  the  numoer  of  potentially  optimum 

strategies  T(R  ,S)  satisfying 

(i)  for  any  occurrence  Em  of  Rm+n  +  ^  in  the  strategy 
indeg  ree  ( Ism  )  <  1 , 

(ii)  there  is  at  least  one  occurrence  1SL  of  Rm+n  +  1  in 
the  strategy  such  that  indegree  (ism  )  , 
respectively.  Clearly  the  total  number  of  potentially 
optimum  strategies  is  we  now  show  that 

N  I  >  M  i  in  i  (  n  )  r  (  m  )  )  • 

In  oraer  to  f ina  Rl^,  it  is  sufficient  to  ooserve  that 
T ( R  , S )  consists  of  a  single  path  ana  eacn  single  attribute 
relation  (except  possibly  the  result  relation)  occurs 


10* 


exactly  once,  Dy  property  7.  Such  a  transmission  strategy 

can  oe  visualizea  as  having  m+n  consecutive  positions,  where 

m  of  the  positions  are  for  {R, , . . . ,R  }  and  n  of  the 

I'm 

positions  are  for  { R^+  ,  .  .  .  ,Rm+n } ;  whenever  two  consecutive 

positions  contain  two  single  attribute  relations  having 

different  attributes,  Rm+n+g  is  assumed  to  be  implicitly 

placed  in  between.  Suppose  the  positions  of  { R^  ,  .  .  .  ,Rm  j  are 

fixed  in  the  m+n  positions.  Then  there  is  no  choice  in 

placing  t Rm+  g  >  •  •  •  'Rm+n+  g )  since  they  occur  in  ascending 

order  of  size.  Thus,  RI^  is  at  least  equal  to  the  numoer  of 

ways  of  choosing  m  positions  for  R^,...,Rm  out  of  m+n 

positions,  i.e.  (m+n) .  If  R^  =  R^  tnen  the  position 

of  R  is  fixed  and  R  occurs  exactly  once,  hence  there  are 
m  m 

at  least 


-m+n-1.  _  .m+n-1. 

1  m-1  ’  (  n  ' 

ways.  Similarly,  if  Rr=^m+n  then  there  are  at  least 

m+n-1  ways  to  place  R, , . . . ,R  into  m+n-1 
m  l  m 

positions.  # 


In  the  next  section,  a  dynamic  programming  approach 
will  oe  presented  to  find  the  optimum  strategy  among  tne 
potentially  optimum  strategies  fully  reducing  a  relation 
with  respect  to  a  query  Q. 


' 

. 
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4 . 4  Optimization 

Consider  a  potentially  optimum  strategy  T(R  ,S)  fully 
reducing  a  relation  with  respect  to  Q.  For  each 
potentially  optimum  sudstrategy  T(R^,S^)  of  T(Rr,S),  the 
result  relation  R^  and  the  set  can  Pe  determined  by 
properties  1-7,  where  R^es^  ano  S^CS.  The  cost  of  T(Rr,S) 
can  be  expressed  in  terms  of  tne  costs  of  its  suostrategies 
T(K^,S  )  and  the  costs  of  the  semi-joins  of  the  form 
R^-->Rr.  The  costs  of  the  substrategies  of  T(R^,S^)  can  be 
determined  recursively  in  terms  of  the  costs  of  their 
substrategies,  until  substrategies  of  the  form  T(Rj,S  )  with 
Sj={R^}  are  obtained.  Such  a  strategy  consists  of  a  single 
node,  hence  its  cost  is  zero.  This  section  first  finds  the 
optimum  strategy  fully  reducing  a  specific  relation  R  with 
respect  to  Q  recursively,  as  outlined  above.  Then  the 
optimum  strategy  fully  reducing  one  of  the  relations  with 
respect  to  Q  is  found.  This,  in  turn,  is  utilized  to  find 
the  optimum  strategy  fully  reducing  all  the  relations  in  Q. 

4.4.1  Parameters  Related  to  the  Optimum  Strategy 

The  optimum  strategy  fully  reducing  a  relation  R^  with 
respect  to  a  query  Q'  (  which  may  be  a  subquery  of  the  given 
query  Q  or  the  query  Q  itself)  is  characterized  by  tnree 
par ame ters : 

R^ :  the  result  relation 

S  :  the  set  of  relations  in  Q' 


. 

■ 


Ill 


a  :  the  attribute  common  to  R  ana  its  immeaiate 
o  l  r 

successor  Rg .  if  q'  is  the  given  query  Q  ana  Rr 
is  the  result  relation  specifiea  Oy  Q,  then 

asr=  ^  s^nce  aoes  not  have  an  immeaiate 
preaecessor  in  this  case. 

Such  an  optimum  strategy  is  denotea  by  t(R  ,s,a  ).  The 
parameter  asr  is  utilizea  to  ensure  that  the  attribute 
common  to  R^  ana  its  immeaiate  preaecessor  R^  is  aifferent 
from  that  between  R^  and  its  immeaiate  preaecessor  (property 
6(b))  when  R^  is  ‘a  multiattribute  relation  and  indegree  of 
R-  is  1. 

4.4.2  Optimum  Strategy  Fully  Reaucing  a  Specifiea  Relation 


Let  R^  be  the  result  relation  specifiea  by  the  query,  S 
be  the  set  of  relations  in  Q,  ana  t(R  ,S,a  )  be  tne  optimum 
strategy  fully  reaucing  R^  over  S.  If  Rr  is  a  single 
attribute  relation  in  S  then  there  are  two  cases:  (case  1) 

Rr  is  the  largest  of  all  single  attribute  relations  in  S 
having  tne  same  attribute  as  Rr  ana  (case  2)  Rr  is  not  the 
largest  of  such  relations  in  S.  If  R  is  a  multiattribute 
relation  in  S  with  t  attributes,  t>l,  then  there  are  also 
two  cases  for  Rr:  (case  3)  inaeg r ee ( Rr ) =t  ana  (case  4) 
inaegr ee (R  ) =1 ,  by  properties  4(a)  ana  5(a).  hence  there  are 
four  cases  for  the  result  relation  R  in  t(R  ,S,a  ). 

L  L  O  L 

Consiaer  a  substrategy  t(R  ,S_,ar_)  of  t(R  ,S,a  ) 
wnere  R^  is  an  immeaiate  predecessor  of  Rr . 


By  property  7, 


. 
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all  single  attribute  relations  (except  possibly  R  )  appear 

exactly  once  ana  in  size  order  in  t(R  ,S,a  ).  Thus,  If  R. 

r  sr  ] 

is  a  single  attribute  relation  in  S  then  it  must  be  the 

largest  single  attribute  relation  in  Sj  with  that  attribute 

(i.e.  case  1  applies  to  R.  in  t(R.,S.,a  .))•  As  illustrated 

in  Example  4.8,  a  mul tiattr ibu te  relation  in  S  may  become  a 

single  attribute  relation  in  S  .  Suppose  the  result  relation 

Rj  of  t(Rj,Sj,arj)  is  a  multiattribute  relation  in  S  but  it 

becomes  a  single  attribute  relation  in  Sj.  Let  the  attribute 

common  to  R.  and  its  immediate  successor  R  be  a.  Since  R. 

J  r  J 

becomes  a  single  attribute  relation  in  S.,  R.  does  not  have 

y  J  J 

attribute  a  in  Sj  and  there  is  no  relation  in  Sj  having 
attribute  a.  Suppose  Rj  has  attribute  b  in  Sj.  If  Rj  is  not 
the  largest  single  attribute  relation  having  attribute  b  in 
S_.  then  one  can  not  replace  Rj  with  a  larger  single 
attribute  relation  with  attribute  b  in  Sj  since  such  a 
relation  does  not  have  attribute  a  and  therefore  can  not  be 
the  immediate  predecessor  of  R^  .  Hence  if  Rj  is  a 
mul tiattr ibu te  relation  in  S  but  becomes  a  single  attribute 
relation  in  Sj  then  Rj  may  or  may  not  be  the  largest  single 
attribute  relation  with  that  attribute  in  Sj  (i.e.  either 
case  1  or  case  2  applies).  If  Rj  is  a  multiattribute 
relation  in  Sj  then  either  case  3  or  case  4  applies  to  Rj  in 
t(Rj,Sj,arj). 

The  optimal  strategy  t(Rr,S,a  )  is  described  below  in 


each  of  the  four  cases  for  Rr .  For  notational  convenience, 


Hii 
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the  largest  of  all  single  attribute  relations  having  the 
same  attribute  a  in  a  set  s  is  aenotea  by  max(a,S) . 

Case  1 :  R  is  a  single  attribute  relation  with  an 

attribute,  say  a_^  ,  in  S  ana  =  max(a^,S)  (see 
Fig.  4.3). 


By  property  4-a,  inaegree (R  ) =1  in  t(R  ,S,a  ).  Bet  R_ 

L  L  o  L 

oe  tne  immediate  predecessor  of  R  ana  t(R  ,S',a  ),  a  ^a  , 

L  J  *■  J  L  J  J 

be  the  substrategy  of  t(R  ,S,a  ).  Since  R  =  max(a  ,S),  Rr 

1  O  1  L  I  -L 

does  not  occur  in  any  substrategy  of  t(R  ,s,a  )  (by 

IT  S  IT 

property  7(b)).  Thus,  S'=S-{R  }.  Let  p  (R  )  be  the  set  of 

L.  o  L 

all  possible  immediate  predecessors  of  R  ,  i.e.  R  6P  (Rr). 

r  j  s  r 

By  property  7,  R^  is  either  the  largest  single  attribute 
relation  having  attribute  a^  in  S'  or  it  is  a  multiattribute 
relation  with  attribute  a^  in  S'.  Thus, 


P  (Rr)  =  {max (a  , S- i R  } ) j  U  {multiattribute  relations 

Sc  J  I- 

having  attribute  a^  in  Sj. 


The  optimal  strategy  t(R  ,s,a  )  is  illustrated  in  figure 

1-  O  L 

4.3. 


The  cost  of  t ( R  ,S,a  ),  in  terms  of  its  suostrategy  is 

IT  S 


Cost  t ( R  ,  S  ,  a  )  =  MIR  {Cost  t(R.,S',a  •)  +  a  -(S')} 

VW  3  3  3 


(4.3) 


wnere  S ' =S- {Rr } , 
Ri  f rom  rd 


arj(S')  is  the  cost  of  transmitting 
to  R  after  it  is  fully  reduced  over  S' 


by 


. 
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Figure  4.0.  The  Optimum  btrategy  t(k  ,s,a_  )  for  Case  1. 

ir  >0  r 


Figure  4.9.  The  optimum  Strategy  t(R  ,S,a  )  for  Case  2. 

L  o  L 


E^igure  4.10.  SuOs  tr  a  teg  i  es  of  t(Rr,S,a  ). 
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t ( , S ' , ar ^ ) .  The  value  of  ar^(S')  can  be  estimated  using 
assumptions  1  ana  2  as  discussed  in  section  3. 

Clearly,  tne  substrategy  t(R. ,s',a  )  of  t(R  ,S,a  ) 

J  r  J  L  L  ^ 

cannot  contain  t(R  ,S,a  )  as  its  suDstrategy  since  S'  does 

tt  r  s 

not  contain  R 

Case  2:  R^  is  a  single  attribute  relation  with  an 

attribute,  say  a  ,  ana  Rr  ^  max(a^,S)  (see  Fig. 
4.9). 

This  case  arises  wnen  R  is  the  result  node  specified 
by  the  given  query  Q  or  R  is  a  multiattribute  relation  in  a 
higher  level  strategy,  say  t(R  ,s_,a  ),  but  it  becomes  a 
single  attribute  relation  in  S,  where  t(R  ,S,a  )  is  a 

L  b  L 

substrategy  of  t (Rs r Sg , aps ) . 

By  property  4-(a),  inaegr ee ( Rr ) =1 .  The  result  noae  R^ 

of  the  substrategy  t(R  ,S',a  •)  of  t(R  ,S,a  )  must  be  a 

J  r  J  1  ^ L 

member  of  p  (R  )  wnich  is  the  same  as  defined  in  Case  1. 

S  £* 

however,  S'  is  either  S  or  S-{Rr)  (by  property  b)  . 

The  cost  of  t ( R  , S , a  ) ,  in  terms  of  its  substrategy  is 

L  o  L 

Cost  t  ( R  ,  S  ,  a  )  =  MIN  {  MIN  {cost  t(R  ,S',a  ) 

r  sr  S'€{S,S-{Rr) }  R  6Ps(Rr)  J  J 

+  irj (S' ) } }  (4.4) 

where  the  parameters  are  as  definea  oefore. 

Clearly,  if  S'=S-{Rr),  the  substrategy  t ( R  , S ' , a r ^ ) 


. 
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does  not  contain  t(R  ,s,a  )  as  its  sucstrategy.  we  now  show 

that  it  S ' =S  then  t ( , s 1 , ar  )  also  aoes  not  contain 

t(Rr,Sfa  )  for  any  attribute  a  .  Suppose  S ' =S  ana  there  is  a 

sucstrategy  t (Rr ,s" ,ap)  ,  S"=S,  of  t(R  ,S',a  ).  Consider  the 

transmission  patn  Detween  the  two  occurrences  of  R^  (see 

Figure  4.10).  since  R^max  (a  ,S)  ,  there  is  at  least  one 

single  attribute  relation  R  in  S  where  R  =max(a  , S ) .  By 

property  7,  R^  must  oe  in  the  transmission  path  oetween  the 

two  occurrences  of  R  since  every  single  attrioute  relation 

in  S  appears  exactly  once  ana  in  size  oraer  except  tne 

occurrence  of  R  which  is  the  result  node  of  t(R  ,S,a  ). 

r  r  sr 

This  implies  that  S"  is  a  proper  subset  of  S  since  there  is 
no  other  occurrence  of  R^  in  the  sucstrategy  with  the  result 
noae  R^  (by  property  7).  This  contradicts  with  S"=S.  Thus, 

t(R  ,S',a  )  can  not  contain  t(R  ,S,a  )  as  its  substrategy. 

3  3  ^  P 

Case  3:  Rr  is  a  mult ia ttr iou te  relation  in  S  and 
inaegr ee ( Rr ) >1  (see  Fig.  4.11). 

By  Property  5,  inaegr ee ( R^ )  =  number  of  attributes  t  of 

Rr  in  S,  ana  there  is  a  unique  partition  S^,...,St  of 
S-{Rrj,  where  is  the  set  of  relations  for  a  substrategy 
of  t ( R  , S , a  ) ,  l<]<t ,  t>l.  Let  X= , . . . ,Rfc j  be  a  set  of 
immediate  predecessors  of  R  in  t(R  ,S,a  ).  Each  relation 
R_^ex  has  an  attribute  common  to  Rr  ,  ana  (by  property  2)  no 
two  relations  in  X  have  a  common  attribute.  For  notational 
convenience,  let  R^  G  X  be  the  result  noae  of  the 
sucstrategy  over  the  set  S_^  ,  l<j<t.  By  property  7,  R^  is 


' 

" 


■ 
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Figure  4.11.  The  Optimum  Strategy  t(R  ,S,a  )  for  Case  3. 

r  sr 


R 


t(Rr'S'asr> 


t  (  R  ,  S , a  .  )  /  a  .  ^a 
r  rj  r  j  sr 


Figure  4.12.  The  Optimum  Strategy  t(R  ,S,a  )  for  Case  4. 

10  s  o 
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eitner  maxja^^ ,s  )  or  a  mul t iattr ibute  relation  witn 
attribute  a^  in  s^  .  Thus  given  the  set  S  ana  the  set  of 
attributes  of  R  in  S,  the  valia  set  of  immediate 
predecessors  X  of  R  in  t(R  ,S,a  )  can  be  determined.  Let 

L  L  ^  L 

f’Xs(Rr)  be  the  set  of  all  valid  sets  of  immediate 
predecessors  of  R^  in  S.  Then  the  cost  of  t(Rr,s,a  )  in 
terms  of  its  substrategies  is 

Cost  t(R  ,S,a  )=  MIR  {  Z  (Cost  t ( R  ,  S  ■  ,  a  ) 

xcpx  ( r  )  Rnex  3  3  3 

s  r  J 

+arj(S  ))}.  (4.5) 


Clearly,  no  substrategy  t(R^,s  ,a  )  can  contain 
t(S,R  ,a  )  as  its  substrategy  since  Sn  C  S-{R  },  l<D<t. 

Case  4;  R^  is  a  multiattribute  relation  in  S  ana 
inaegree (R^ ) =1  (see  Fig.  4.12) 


By  property  6(a),  the  substrategy  of  t(S,R  ,a  )  is 

l  s  r 

also  over  the  set  S.  Let  R^  be  tne  immediate  predecessor  of 
R  in  t(S,R  ,a  ).  Since  Rf  has  t  attributes  ar]_'--*art 
S,  t>l,  R_^  must  be  one  of  the  relations  in  the  set 


P  (R  )  =  U  {{max(a  . ,S) j  U  {multiattribute  relations 

j=l  3  in  S  having  attribute 

a  ,  except  Rf } } . 


If  k  has  an  immediate  successor,  i.e.  a  ^  0,  then  the 

I T  S  L 

immediate  predecessor  R  of  R  is  a  relation  in  P  (R  )  such 

J  L  O  L 

tnat  R  aoes  not  have  attribute  a  (oy  property  6(d)). 
Thus,  the  cost  of  t(R  ,S,a  )  in  terms  of  its  substrategies 
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1  s 


Cost  t  (  R  ,  S  ,  a  )  =  MIN  {Cost  t(R.,S,a  .)  +  a  .  (  S)  } 
r  Sr  R.GP  (R  )  D  '  rj  r] 

j  s  r 

a  .  ^a 
r  y  sr 


(4.6) 


We  now  show  that  t(Rj,S,a^j)  does  not  contain 
t ( S ,  R  ,  a  )  as  a  substrategy,  for  any  attribute  a  .  It 


sr 


suffices  to  consider  the  following  different  cases  for  the 
immediate  predecessor  R_.  of  R^ . 

(i)  Rj  is  a  single  attribute  relation  in  S.  Then 

t(Rj,S,arj)  does  not  contain  a  substrategy  over  S 
with  result  node  Rr  (from  Case  (1)  and  Case  (2)). 
Therefore,  t(R.,S,a  .)  can  not  contain  t(R  ,S,a  ). 

J  jTJ  2T  S2T 

(ii)  Rj  is  a  mul t lat tr ibu te  relation  in  S  and 

mdegr ee ( R j ) > 1 .  Then  t(Rj,S,arj)  does  not  contain 
t(R  ,S,a  )  since  it  does  not  contain  any  substrategy 

r  S  r 

o  ve  r  S  ( f  r  om  Case  3  )  . 


(in)  Rj  is  a  mul  tiattr  ibu  te  relation  in  S,  and 

l ndegr ee ( R j ) =1 .  Then  applying  the  above  arguments  to 
the  immediate  predecessor  of  R^  recursively,  the  only 
case  when  t(R.,S,arj)  contains  a  substrategy  over  S 
which  may  contain  t(R  , S,a  )  is  when  the  immediate 
predecessor  is  a  mul tiattr ibu te  relation  in  S  with 
indegree  one.  Let  R  .,  R-,  ,  .  .  .  ,R  be  such  predecessors 

J  J-  L. 

of  R.,  where  R,  ,  l<k<t,  is  a  mul  tiattr  lbute  relation 
J  K 

m  S,  indegree ( R^ ) =1  and  the  substrategy  with  the 
result  relation  R  is  over  the  set  S  (see  figure 
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4.13).  The  following  lemma  shows  that  each 

multi-attribute  relation  in  S  can  occur  at  most  once 

in  such  a  path  R^4 — R^  4 — R^4 — ...4 — Rfc.  Therefore, 

t ( R  ,  S  ,  a  )  can  not  contain  t (R  ,S,a  )  as  its 
j  t  j  r  s  r 

substrategy,  i.e.  equation  (4.6)  also  converges. 

Lemma  4.14:  Let  R^.  ,  R^  ,  R^,...,Rt  be  the  multi-attribute 
relations  in  S  and  each  is  the  result  relation  of  a  strategy 
over  S  ana  has  one  immediate  predecessor  such  that 
Rr 4--R_^ 4--R^4 — ...4 — Rfc  is  a  transmission  path  in 
t ( R  ,  S  ,  a  )  (see  Fig.  4.13).  Then  the  relations 

1  o  L 

Rf ,R  rR^ , . . . fRt  are  different  relations. 

Proof ;  R^R^  ana  R^R^  (by  property  1).  Since  ar-j^a^,  Rr^Rg 

because  otherwise  R_^  ana  R-^  have  two  attributes  in  common. 

Suppose  no  relation  occurs  more  than  once  in 

Rr4 — R^4 — R14 — ...4 — rr,  l<k<t-l,  but  Rk+i  is  the  same  as 

one  of  the  relations,  say  R^ ,  in  the  path 

Rr4  —  R^4 —  R1  .  .  .4  —  Ri4 —  .  .  .4 — RR4 — RR+1 ,  wnere 

i€ {r ,j ,1 ,  .  .  .  ,kj .  Let  the  immediate  successor  of  R^  in  the 

path  be  Rk_1,  i.e.  k-l=j  for  k  =  l.  Consider  relations  Rk_x, 

Rk  and  Rk+1‘  Since  ak-l,K^ak,k+i'  Rk-1  ana  Rk+1  can  not  have 
any  common  attribute  (by  Lemma  4.5).  Then  consider  the  patn 

Ri4  —  Ri  +  J_4 — ••*4""Rk-l-  3ince  Ri  =  Rk-l'  °ne  °f  the  relations 
in  tne  path  must  be  R  (by  Lemma4.4).  But  this  contradicts 

with  the  fact  that  Rr  ,  R-,  ,  Rx  ,  •  •  •  ,  Rk  are  all  different 

relations.  # 

In  each  of  the  four  cases,  tne  cost  of  t(R  ,s,a  )  is 


■ 
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t(Kr'S'asr 
t (Rj ,S ,ar j  ) 

t  (  R  ^  /  S  f  d  ^  2  ) 


cl  7*“  O 

ey 


a3^a 


sr 


r  j 


Figure  4.13.  Illustration  for  Lemma  4.14. 
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eventually  expressed  in  terms  of  the  costs  of  substrategies 
over  strictly  smaller  sets  of  relations.  Consequently, 
equations  (4.3)-(4.6)  can  be  used  recursively  until  the  cost 

°f  t^r'^'asr^  is  exPresse<a  ln  terms  of  some  constants  and 

the  costs  of  substrategies  of  the  form  t  ( R  .  ,  s  ■  ,  a,  •),  where 

1  3  k  j 

S  j  =  { R  j }  .  The  boundary  conditions  are 
Cost  t(Rj,  { R j } , a^ j )  =  0 

for  all  the  relations  R^  in  Q  and  the  joining  attributes  a^^ 
in  Q . 


The  complexity  of  computing  the  optimum  strategy 

T(R^,S)  =  t(Rr,S,0)  is  illustrated  for  the  query  given  in 

Example  4.10.  Without  loss  of  generality,  let  the  relations 

be  I R1 ( a) I < . . . < I Rm( a ) | ,  I Rm+1 ( b ) I < . . . < I Rm+n ( b ) I  and  the 

result  node  be  R  ,  .  -,(a,b).  Let 

m+n+1 

Sk,  j,  l=^Rl^a^  '  *  '  ‘  'Rk(a)  '  Rm+l(b:)  '  *  *  •  ,Rm+j  '  Rm+n+l(a,b^* 
0<k<m,  0<  j  <n,  k+  j  >  1 .  In  S^,  ■  0  Rra+n+l^a,b^  1S  absent*  Tlie 

optimal  strategy  is  t (  Rm+n+1 , n  ?  ±  ,  0  )  . 


The  recursive  algorithm  can  be  visualized  as  having  m.n 
stages.  At  stage  (k,j),  l<k<m,  l<j<n,  the  optimal  strategies 
over  Sk  j  x  with  result  nodes  Rk,  Rm+j,  Rm+n+1  are  expressed 
in  terms  of  the  optimal  strategies  at  stages  (k— 1, j)  and 
(k,j-l)  (see  Figure  4.14).  using  equations  (4.3)-(4.6).  More 
precisely,  we  compute  the  costs  of  strategies 

t(VSk,j,l'a)'  t(Rm+j'Sk.j,l'b)'  t(Rm+n+l'Sk,j,l'a>  and 

t(Rm+n+l'Sk, j,l'b)-  F°r  eXample'  in  t(Rk'Sk, j,l'a)  SlnCe  Rk 
is  the  largest  single  attribute  relation  with  attribute  a  in 


■ 
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Figure  4.14.  Illustration  of  the  m.n  Stages  in  Computing  the 


Optimal  Strategy  over  S=Sm  ^  ^ 
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Sk  ■  l<k<m,  l£3<n,  its  immediate  predecessor  is  either 

Rk_j_  or  Rm+n+1  .  Using  eq .  (4.3)  we  have 


Cost  t(Rk, Sk^  1,a)  = 


MIN  {Cost  t(Rk_1,Sk_1  j  1#a), 

Cost  t(Rm+n+l'Sk-l, j,l'a)i 


Since  the  data  transferred  to  Rn  is  from  a  relation  with 

k 

a  ttr lbu  te  a  and  i  s  reduced  by  {  R^  ,  .  .  .  ,  Rk_  ^  ,  Rm+ -|_  ,  .  •  .  /  j  > 
Rm+n  +  -^},  using  assumptions  1  and  2  for  estimation  as 
illustrated  in  Example  4.4,  a(S  ,  •  ,  )  is  given  by 


^  P: 


'0  '  "1  *  t^a1  *  • 


Pa  *pb 

k-1  m+1 


i  .  p  .  w  .  I A I  } 

m+j  m+n+1 


where  p^  and  pk  are 
l  m+t 

relation  R.  and  column  b  of  relation  R. 


of  column  a  of 
m+  respectively, 


1 < l <k- 1 ,  1 < t < j  . 


The  costs  of  t(R  , . , S.  •  ,,b)  can  be  computed 

Ul“r  J  K  /  J  /  -L 

similarly.  To  compute  the  costs  of  t  ( Rm+n  +  j  ,  Sk  ^  ^a)  and 

t  ( R  ,  ,Sn  •  ,,b),  for  1  <  k  <m ,  l<j<n,  there  are  two  cases. 

m+n+1'  k,j,l  -  -  ~ 

If  mdegree ( Rn+n+1 ) =2  then  using  eq .  (4.5), 


Cost  t(Ro+n+l'Sk,  j,  1 


a)=Cost  t(Rm+n+1,Sk> j#1.b) 

“Cost  t(Rk,Skr0(0,a)+5(Skj0(0) 
+Cost  t(Rm+j,S0ik;0,b)+b(S0(kj0) . 


If  indegree ( Rm+n+1 ) =1  then  using  eq .  (4.6) 

cost  t(Rm+n+i,Sk( jfl,b)=Cost  t(Rk,Skij(1,a)+5(Skj j(1). 


. 
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bote  that,  the  immediate  predecessor  of  R  ,  in 

rn+n+l 

t  ( ^m+n+l '  £>k  j  i  >  t> )  must  be  R^.  and  the  set  S-^  j  ^  remains 

unchanged  in  the  substrategy  with  result  node  R  (by 

K 

property  6).  The  cost  of  the  substrategy  t(K  , sv  .,a)  is 

k  k  ,  j  ,  l 

expressed  in  terms  of  the  costs  of  t(R,  ,  ,SV  .  ,  ,  a )  and 
t ( Rm+n+l ' ^k- 1  j  l,a^  as  9lven  above.  The  cost  of 
t^Rm+n+l'^k  j  l,a^  can  be  computed  similarly. 


The  costs  of  the  optimal  strategies  over  the  sets 

Sk,0,l'  S0,j,l'  Sk,0,0  and  S0,j,0  are  comPuted  in  1  i near 

time.  For  example,  consider  t ( Rm+n+1 , Sk  0  x , b ) .  Rm+n+l  is  a 

single  attribute  relation  in  S,  „  ,  but  it  is  a 
^  k, 0 , 1 

mul t lat tr lbu te  relation  in  the  next  higher  level  strategy 

since  the  attribute  b  is  common  to  R  ,  ,  ,  and  its  immediate 

m+n+1 

successor.  If  I  Rm+n  + [  a  ]  I  >  I  R^Ca]  I  then  by  property  7, 

Cost  t(Rm+n+1/Sk, 0,1,b)=Cost  (R1  >R2  >...-->Rk  >Rm+n+i) • 


If  lRm+n+^[a]|  <  lRv[a]|  then  eg.  (4.4)  applies  resulting 


Cost  t(Rm+n+1.Ski0<1,b)-MIN  [Cost  t(Rk,Skj0_0,a)+a(Skj0i0), 

Cost  t(Rk,Sk;0(1,a)+S(Ski0jl)}, 


where 


Cost  t(Rm+n+1,Sk<0(0,a)=Cost  (n— >• 
Cost  t(Rm+n  +  l'Sk,0,l'a)=Cost  (R1-V- 


-->Rk>' 


.-->R  ,  ,, - + 

m+n+1 


) 


and  Rm+n+1[a]  appears  in  size  order  in  t ( Rm+n+i - sk, 0 , 1 < a> • 
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The  boundary  conditions  are 


Cost  t<R1.Slj0(0,a)  -  Cost  t(Rm+1,S0_m+1_0,b)  - 
Cost  t  (  Rm+n+1  >S0j  0  _  i '  a)  -  Cost  t(RRl+n+1,  S0<  0, 1  ,b)  =  0 


Thus,  with  the  exception  of  certain  end  conditions  like 
Cost  t(Rm+n+1,Sk^ D/1,a)  with  indegree(Rm+n+1)=2,  it  takes 
only  a  constant  number  of  operations  to  compute  the  optimal 
strategies  at  the  stage  (k, j)  if  the  optimal  strategies  at 
the  stages  ( k — 1 , j )  and  (k,j-l)  are  known.  Since  there  are 
m.n  stages,  the  total  number  of  operations  required  to 
compute  the  optimal  strategy  is  O(m.n)  while  there  are  at 
least  0(  (m^n)  )  potentially  optimum  strategies  (see 

example  4.10)  for  the  class  of  queries  considered  above. 
However,  for  general  tree  queries  finding  the  optimal 
strategy  may  take  exponential  time. 


4.4.3  Optimum  Strategy  Fully  Reducing  one  of  the  Relations 

Consider  a  query  Q  such  that  each  relation  referenced 
by  Q  has  a  distinct  output  attribute.  Suppose  the  result 
node  does  not  contain  any  of  the  relations  referenced  by  Q. 
The  optimal  way  to  obtain  the  answer  of  the  query  Q  at  the 
result  node  is  to  use  an  optimum  sequence  of  semi-joins 
fully  reducing  all  the  relations  referenced  by  Q  and  then 
send  the  projections  over  the  output  attributes  to  the 
result  node.  Let  R  be  the  only  relation  fully  reduced  with 
respect  to  Q  by  a  strategy  T(R  ,S).  By  Prop.  4.1,  there 
exist  a  tree  query  graph  QG  of  an  equivalent  query  such  that 


■ 
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T(R  ,S)  traverses  QG  from  the  leaves  to  the  root  R..  Let  S 
i  1 

* 

have  N,  N>2,  relations.  A  sequence  of  N-l  seni-joins,  z  , 

which  traverses  QG  from  the  root  R  back  to  the  leaves,  if 

preceeded  by  T(R^,S),  fully  reduces  the  remaining  relations 
_  ★ 

L 3 J  .  z  has  the  property  that  for  each  semi-join  R. — >R  in 

J  k 

★ 

z  ,  R^  is  fully  reduced  with  respect  to  Q  before  the 

semi- join  is  executed.  This  corresponds  to  the  minimum 

amount  of  data  transfer  required  for  the  semi-join  R. — >Rn  . 

^  J  j  k 


Let  A  (  S  )  =  { a^ , . . . , a  }  be  the  set  of  all  attr  ibu  tes 

referenced  by  the  qualification  of  Q.  Clearly,  each 

attribute  a^  in  A(S)  corresponds  to  a  connected  component  in 

JG_.  Let  m.  be  the  number  of  vertices  m  the  connected 
Q  D 

component  of  JG^  corresponding  to  the  attribute  a.,  l<j<n. 
Since  QG  is  a  tree  query  graph  of  an  equivalent  query,  there 


is  a  spanning  forest  JG  of  JG^  representing  that 
equivalent  query  (by  Lemma  4.2).  There  are  rru-1  edges  in  the 
connected  component  of  JG  corresponding  to  the  attribute  a^. 
By  Lemma  4.1(a),  QG  contains  exactly  nm-1  edges  between 

relations  having  attribute  a^  in  common.  Let  (R^  V  be  one 

* 

of  those  nrn-1  edges  in  QG.  z  either  contains  R_l — >RR  or 

R,  — >r . .  In  either  case  the  cost  of  the  semi-join  is  a-(S) 
k  J  J 

since  the  semi-join  R — >Rk  (resp.  Rk — >R±)  is  executed 

after  R  (resp  R,  )  is  fully  reduced  with  respect  to  Q  and  Q 
1  K 

is  a  conjunction  of  equi-join  clauses.  Since  the  same 

★ 

arguments  are  true  for  every  attribute  in  A(S)  and  z 


contains  exactly  one  semi-join  for  each  edge  in  QG, 


. 
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Cost 


* 


(z  ) 


n 

Z(m  -1)  . i  .  (S) . 

j  =  l  D  11 


Clearly,  the  cost  of  z  is  the  same  irrespective  of  which 
relation  in  S  is  the  first  fully  reduced  relation.  Consider 
any  other  sequence  z  which  is  preceeded  by  T(Ri,S)  and  fully 
reduces  the  remaining  N-l  relations  such  that  Ri  is 
transmitted  to  the  other  relations  via  some  semi-joins.  The 
semi-joins  in  z  form  a  transmission  path  from  R^  to  every 

other  relation  R^,  j^i,  l<i, j<N.  Then  z  traverses  at  least  a 

* 

spanning  tree  of  QG  .  By  Lemma  4.3(b),  any  spanning  tree  of 

* 


QG  is  a  tree  query  graph  QG'  of  an  equivalent  query. 
Similar  to  QG,  for  each  attribute  a^  in  A(S),  QG'  contains 
nn-1  edges  which  are  between  relations  having  attribute  a^ 
in  common.  Consider  one  of  such  edges  (R-,  Rv )  in  QG'  and 
let  R^-'^R^  be  the  corresponding  semi-join  in  z.  The 
Cost(R^ — >R^.)  -  aj(S)  since  R^  may  not  be  fully  reduced 
before  R.[a]  is  transmitted  to  Rv .  Since  the  same  arguments 

are  applicable  to  every  semi-join  in  z  corresponding  to  the 

* 

edges  of  QG' ,  Cost(z)  can  not  be  lower  than  Cost(z  ).  That 

★ 

is,  z  has  the  minimum  cost  to  fully  reduce  the  remaining 
N-l  relations  after  R^  is  fully  reduced  by  T(R^,S)  for  any 
R  in  S  . 

l 


Let  t(S)  be  the  optimum  strategy  fully  reducing  one  of 
the  relations  in  S.  It  is  straightforward  to  show  that  for 
any  two  single  attribute  relations  R^  and  R ^  in  S  having 
attribute  a  in  common,  Cost  T(Ri,S)  <  Cost  T(R^,S)  if 


' 


■ 
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lR^[a]|  >  I  R j [ a H I  .  Thus,  in  order  to  find  t(S)  it  is 

sufficient  to  consider  the  set  of  all  optimal  strategies 

T(R^,S),  where  R^.  is  either  a  mul  tiattr  lbu  te  relation  in  S 

or  R  =  max(a.,S)  for  a.GA(S),  and  then  choose  the  one  with 
K  3  ] 

minimum  cost  among  those  strategies.  Consequently,  an 

optimal  strategy  fully  reducing  all  the  relations  in  S 

* 

consists  of  t(S)  followed  by  z  . 

4.4.4  An  Example 

In  order  to  illustrate  the  procedure  to  find  the 
optimum  strategy  fully  reducing  a  relation  at  the  result 
node,  we  consider  a  distributed  database  with  four 
relations, 

R, :  EMPLOYEE  ( E# ,  Ename,Sex) 

R2 :  S TU DENT- COURSE ( E# ,  C# ) 

R^ :  COURSE( C# ,  Cname ,  Level) 

R4 :  TEACHER-COURSE ( E# ,  C# ,  Room). 

This  is  the  same  example  used  in  [13].  Suppose  each  relation 
resides  at  a  separate  site  and  the  site  containing  R  is  the 
result  node.  Consider  the  following  query  Q, 

"find  the  employee  numbers  of  the  employees  who  are 
students  in  a  course  and  teaching  an  advanced  level 
course" . 

The  qualification  of  Q  is 

(R  .E#  =  R2.E#) .and. (R2.E#  =  R^ .E# ) . and. 

(R4-C#  =  R3 . C# ) .and. ( R3 .Level  = 


"Advanced ) . 


■ 
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The  target  list  of  the  query  is  R  .E#  . 

After  local  processing,  the  selection  clause 
( . Level  =  "Advanced" )  is  eliminated  and  each  relation  can 
be  projected  over  its  attributes  referenced  by  Q.  That  is, 
the  set  S  of  relations  after  local  processing  is 

s  =  {r1(e#)/  r2(e#),  r3(c#)/  r4(e#,c#)}. 


Let  s^  denote  the  size  of  relation  R^,  l<i<4.  Suppose 

the  sizes  and  the  sel ect ivi ties  of  the  relations  are  as 
given  in  Figure  4.15,  where  the  selectivity  p^^  ,  i=l,2,4,  is 

the  probability  that  R^[E#]  contains  a  given  value  of  the 
domain  corresponding  to  E# .  The  selectivity  PlC^/  i=3,4,  is 
defined  similarly. 


Let  the  cost  of  transmitting  X  amount  of  data  from  one 
site  to  another  be  C(X)=10+X.  For  estimating  the  size  of  a 
relation  with  more  than  one  attribute,  e.g.  R4(E#,C#),  the 
assumption  in  [13]  will  be  used  to  facilitate  a  meaningful 
comparison  between  the  algorithm  presented  in  Section  4.4.2 
and  the  one  presented  in  [13] .  Thus,  after  the  semi-join 
r3-->r4,  where  R3  and  R4  have  attribute  C#  in  common,  the 
estimated  size  of  R4[C#]  is  I R4  [  C# ]  I  *  =  300  *  2/3  while 

the  size  of  R4[E#]  is  unchanged. 

The  optimal  strategy  T(RX,S)  fully  reducing  R±  with 
respect  to  Q,  computed  using  equations  (4. 3) -(4.6),  is  shown 
in  Figure  4.16,  where 


. 
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s±  I  I Ri[ E# ] I 


lE# 


I R, [C#] 


PiC# 


h 

200 

200 

1/2 

— 

R2 

300 

300 

3/4 

— 

R3 

400 

400 

2/3 

R4 

600 

200 

1/2 

300 

1/2 

Figure  4.15.  The  Sizes  and  the  Sel ectivi tie s  of  the 
Rela  tions . 


T( R± , S)  :  Rx 


R, 


R. 


310 

TG:  R2  :  | - - - >  I  R± 

410 

R3  :  * - >  *  R1 

210  310 

R4:  I  7 - >1 - >!  R] 

I  R-L  [  E#  ]  I  R4 


(a)  Cost  T(R1#S)  =  780.  (b)  Cost  TG  =  1240. 

Figure  4.1b.  The  Optimal  Strategy  T ( R^ , S )  and  the  Strategy 


TG  . 


•*  ‘  '  ■* 
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Cost(R3-->R4)  =  10+|R3[C#]|  =  10+400  =  410. 

Cost(R4 — >r2)  =  10+|R4[E#]|  =  10+200  =  210. 

Cost(R2 — >RX)  =  10+ | R2[E#] | *P4C£  =  10+300*1/2  =  160. 

Thus, 

Cost  T(R1,S)  =  410+210+160  =  780. 

Since  E#  is  the  output  attribute  and  R^  resides  at  the 
result  node  T  (  R^ , S  )  is  the  optimum  strategy  to  answer  the 
query . 


The  strategy  TG  found  by  using  the  algorithm  given  in 
[13]  (namely,  Algorithm  G)  to  answer  Q  is  shown  in  Figure 
4.16.  The  cost  of  transferring  R2  to  R^  is  310,  the  cost  of 
transferring  R^  to  R^  is  410  and  the  cost  of  transferring  R^ 
to  R-^  is  210+10+600/2  =  520.  Hence,  the  total  cost  of  TG  is 
410+310+520  =  1240,  which  is  significantly  higher  than  the 
cost  of  the  optimal  strategy  T ( R^ , s )  for  this  example. 


CHAPTER  5 


CONCLUSIONS 

In  this  thesis,  two  related  distributee  relational 
database  query  processing  problems  nave  been  stuoieu,  and 
solutions  are  proposed.  The  first  proolem  is  to  determine 
wnether  or  not  a  distributee  query  Delongs  to  a  special 
class  of  queries,  called  tree  queries.  Tree  queries  can 
always  be  answered  by  using  only  semi-joins  which,  in  turn, 
are  usually  computed  with  much  less  data  transmission  than 
joins.  Thus,  it  is  important  to  be  able  to  determine  the 
tree-query  membership  of  a  distributee  query. 

The  second  problem  studied  in  this  thesis  is 
distributed  query  optimization  for  tree  queries.  This  is  an 
important  special  case  of  the  general  query  optimization 
proolem  for  whicn  tnere  is  no  known  optimum  solution. 

For  the  tree  query  membership  problem,  distributed 
database  queries  that  are  conjunctions  of  join  ano  selection 
clauses  witn  relational  operators  {>,  <,  =,  <,  >}  nave 

been  considered.  A  canonical  representation  of  a  oistribueo 
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query,  called  reduced  join  graph,  has  been  introduced.  The 
query  represented  by  a  reduced  join  grapli  does  not  contain 
any  redundant  join  clauses,  and  is  equivalent  to  the 
original  query.  It  has  been  shown  that  the  set  of  equivalent 
queries  represented  by  reduced  join  graphs  sufficiently 
characterize  the  set  of  all  equivalent  queries  for 
determining  tree  query  membership  of  a  given  query.  A 
conceptually  simple  and  efficient  algorithm  has  been 
presented  which  determines  whether  or  not  such  a  query  is  a 
tree  query.  For  tree  queries,  the  presented  algorithm 
produces  an  equivalent  query  whose  query  graph  is  a  tree, 
and  outputs  the  equivalent  query  together  with  its  tree 
query  graph  so  that  a  sequence  of  semi-joins  to  answer  the 
original  query  can  immediately  be  obtained.  Thus,  the 
presented  algorithm  can  effectively  be  utilized  to  improve 
the  performance  of  distributed  query  processing  mechanisms. 
An  implementation  of  the  algorithm  has  been  outlined  and  its 
time  and  space  requirements  have  been  established.  The 
extension  of  this  work  for  queries  involving  disjunctions  is 
straightforward  since  such  queries  can  be  transformed  into 
disjunctive  normal  form  and  each  conjunction  can  be  treated 
as  a  separate  query.  The  tree  query  membership  algorithm 
could  be  further  extended  for  more  general  queries  (e.g. 
queries  with  embedded  quantifiers). 

In  general ,  there  are  numerous  semi-join  sequences, 
called  strategies,  that  can  be  used  to  answer  a  tree  query, 
and  these  strategies  may  differ  significantly  in  the  amount 
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of  required  data  transmission.  In  order 
strategy,  first  a  set  of  properties  has 
eliminate  the  strategies  that  can  never 
a  dynamic  programming  approach  has  been 
finds  the  optimum  strategy  among  the  po 
nnes . 


to  find  the  optimum 
been  given  to 
be  the  optimum,  then 
presented  which 
tentially  optimum 


Distributed  query  optimization  for  tree  queries 
presented  in  this  thesis  uses  only  one  size  estimation 
assumption  which  is  the  uniformity  and  independence  of 
values  drawn  from  the  same  domain.  This  assumption  leads  to 
a  reasonably  good  estimation  for  the  size  of  the  relation's 
column  on  which  the  semi-join  is  performed.  One  of  the 
important  problems  to  be  solved  in  distributed  query 
processing  is  to  find  a  good  method  to  estimate  the  sizes  of 
other  columns  of  a  relation  after  one  of  its  columns  are 
reduced.  The  proposed  estimation  methods  in  the  literature 
involve  inaccuracies,  which  may  significantly  reduce  the 
effectiveness  of  the  query  optimization.  A  novelty  of  the 
optimization  presented  in  Chapter  4  is  that  it  is 
independent  of  the  choice  of  the  method  used  to  estimate  the 
sizes  of  other  columns  of  a  relation  when  one  of  its  columns 
is  reduced. 


The  distributed  query  optimization  presented  in  this 
thesis  is  for  a  class  of  tree  queries,  i.e.,  queries  with 
equi-join  clauses  where  the  number  of  common  joining 
attributes  between  any  two  relations  in  the  query  is  at  most 


. 
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one.  This  work  may  oe  extenaea  for  more  general  tree 
queries,  hopefully,  further  generalization  leaas  to  optimal 
query  processing  strategies  for  aistrioutea  queries  in 
general . 
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APPEND IX- A : 


This  appendix  formalizes  the  transitive  reauction  of  an 
acyclic  join  graph  that  is  utilized  in  Section  3.2  for 
obtaining  the  reducea  join  graph  of  a  query.  The 
construction  of  tne  transitive  reauction  of  an  acyclic  join 
grapn  is  also  given  ana  its  time  complexity  is  estaolished. 

A . 1 .  Transitive  Reauction  of  an  Acyclic  Join  grapn 

Let  G  be  the  conaensea  acyclic  graph  for  the  join  graph 

jq  of  a  query  Q.  Goserve  tnat,  if  JG^  is  acyclic  then  G 

denotes  JG_  itself.  Informally,  the  transitive  reduction  Gfc 
G 

of  G  is  a  directed  grapn  that  aoes  not  have  any  reaunaant 
arcs  while  having  the  same  transitive  closure  as  G,  i.e. 

(Gfc ) T=GT [ 1] .  The  transitive  reduction  in  our  context  is  a 
direct  extension  of  the  one  in  [1],  for  acyclic  join  graphs 
tnat  contain  three  types  of  arcs.  The  union  ana  the 
intersection  operations  on  directed  graphs  are  extended  for 
join  grapns  as  given  below. 

A. 1.1.  pasic  Definitions 

yet  T=  { to , 1 , 2 , 3  }  oe  the  set  of  types  of  arcs  in  G  wnere 
b  oenotes  the  type  of  a  nonexisting  arc  (i.e.  it  there  is  no 
arc  from  u  to  v,  it  is  considered  as  a  type-u  arc),  we 
define  a  binary  audition  operator,  (b ,  ana  a  oinary 
multiplication  operator,  *,  over  tne  set  T  witn  the  aaaition 
ana  multiplication  taoles  as  follows: 
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$ 


0 

1 

2 

3 


0  1  2  3 


0  12  3 
112  2 
2  2  2  2 
3  2  2  3 


0 

1 

2 


0  12  3 


0  0  0  0 

0  12\o 
0  2  2  0 

0  0  0  0 


Clearly,  ©  is  associative  ana  commutative,  0  is  tne 
iaentity  of  ©,  i.e.  i©0  =  0©i  =  i,  for  all  iCT,  ana  T 
is  closed  unaer  ©.  Likewise  *  is  associative  ana 
commutative,  T  is  closea  under  *,  ana  *  distrioutes  over 
©.  Observe  that,  for  any  two  arcs  (u,v)^  ana  (v,u)^  in  G , 
i©3  is  tne  type  of  the  arc  from  u  to  v  that  aominates 
(u,v)^  ana  (u,v)  .  By  convention,  the  sum  over  an  empty  set 

of  arcs  is  0. 


As  aiscussea  previously  (Section  2.2),  each  arc  (u,v)^, 

l<i<3,  in  G,  is  a  type-i  path  of  length  1  from  u  to  v.  A 

sequence  of  arcs  (vk'v]_)'  (v-^ ,  v2)  r  •  •  •  •  (VK-l'vk^'  is  a 

path  from  v,  to  v,  of  length  k  if  none  of  tne  arcs  in  the 
c  0  k 

path  is  type-3.  The  type  of  sucn  a  path  is  1  if  every  arc 
(v i-1 ' vi )  ,  l<i<k'  is  a  typ0-1  arc*  Otherwise  it  is  a  type-2 
path.  Observe  tnat  the  proauct  of  the  types  of  arcs  in  a 

sequence  of  arcs  p  =  ^vi,v2^''*''^Vk-l,Vk^  Paves 

the  type  of  the  path  from  v^  to  vR  if  the  arcs  in  p  form  a 
path;  otherwise  the  proauct  is  0  inaicating  that  the  arcs  in 
p  does  not  form  a  path  from  v^  to  v^ .  Since  G  is  acyclic  and 
contains  no  self  loops,  the  lengtn  of  any  patn  in  G  is 
Detw^en  1  ana  n-1,  where  n  is  the  numo^r  of  vertices  in  G, 
ana  the  type  of  every  path  is  well-ue f inea . 
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We  now  give  the  extended  union  and  intersection 
operators,  denoted  fl  and  U.  respectively,  on  join  graphs  (see 
Figure  A.l)  .  Let  G^  and  G2  be  any  two  finite  acyclic  join 
graphs*  having  the  same  vertex  sets.  Then 


Gl'  -  G2  * 

{ ( U, v) i 

1  ( u , v ) jSG^ , 

(u, v)kGG2  and  i=j©k, 

1< j,k<3} 

G1  U  C2  = 

{ ( U, v) i 

1  (u,v)iSG1 

fl  g2}  u 

{ ( u, v) i 

1  (u,v)i6G1 

and  (u,v)_j0G2  for  any 

1  <  j  <  3  }  U 

{  (  U,  V  )  i 

1  (u,v)i6G2 

and  (u,v)j0G^  for  any 

1< j  <3 } 

A. 1.2.  Transitive  Reduction 


For  a  given  finite  acyclic  join  graph  G,  let  S(G)  be 

the  set  of  all  join  graphs  having  the  same  transitive 

T  T 

closure  as  G,  i.e.  S(G)  =  {G^  I  G-^  =  G  }  .  The  transitive 
reduction  G*  of  G  is  defined  as 

G  *  =  rt_  G . 

g  es(G)  1 


Proposition  A.l;  For  any  finite  acyclic  join  graph  G,  the 
transitive  reduction  G *  is  unique  and  satisfies 
(1)  ( Gfc ) T  =  GT  and 

(ii)  if  H  =  G  then  for  any  arc  (u, vJ^SG  ,  there  is  an 
arc  (u,v)j6H,  l<i,j£3. 

*  Throughout  the  Appendix,  a  graph  G  is  represented  as  a  set 
containing  all  the  arcs  of  G.  The  symbols  U,  fl  and  -  are 
used  to  denote  the  usual  set  union,  set  intersection  and  set 
subtraction  operators,  respectively. 
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4 


G'  :  1 


Figure  A.l:  The  Operators  U  and  fl. 
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P r oof :  follows  from  the  following  three  lemmas.  (  Note  that 
Lemma  A.l  and  Lemmas  A. 2  and  A. 3  are  actually 
straightforward  extensions  of  Lemma  1  and  Lemma  2  in  [1] . ) 

In  Lemmas  A.l  and  A. 2  below,  G^  and  G2  denote  any  two  finite 
acyclic  join  graphs  (over  the  same  vertex  set)  satisfying 


L ernrna  A.l:  If  there  is  an  arc  a  =  (u,v)i#  l<i<3,  in  Gj^  such 

that  there  is  no  arc  from  u  to  v  in  Q  ,  then 
(G1-{a})T  =  G^"  =  G^  =  {G2  U  { a }  } T 
where  u  and  v  are  any  two  vertices  in  G^ . 

Proof ;  Since  a  €  G^ ,  there  is  an  arc  b  =  (u,v)-|,  in  G^, 
l£k<3.  Since  G^  =  G^,  b  G  G^ .  Then  the  type  of  b  is 
either  1  or  2  since  G2  does  not  contain  any  arc  from  u  to  v 
and  no  type-3  arc  can  be  implied  by  a  path  from  u  to  v  of 
1  ength  >  2  m  G^ 

Case  1 ;  b  =  (u,v)^.  Then  a  =  (u,v)^  and  there  is  no 

(implied  or  not)  type-2  path  from  u  to  v  in  G^  or  G^ •  Since 
T 

b  G  G2  there  is  at  least  one  type-1  path  in  G2  from  u  to  v 

T 

passing  through  some  other  vertex  w.  Thus  (G  U  {a})  = 

T 

G2 •  Moreover  G^  also  contains  a  type-1  path  from  u  to  w 
and  a  type-1  path  from  w  to  v.  Since  G^  is  acyclic,  neither 
the  path  from  u  to  w  nor  the  path  w  to  v  can  contain  a  = 
(u,v)^.  Thus  G^  -  {a}  also  contains  a  type-1  path  from  u  to 
v,  and  ( G-^  -  {a}  )T  =  G^  =  G^  • 

Case  2:  b  =  (u,v)2.  The  type  of  a  may  be  1,  2  or  3. 


' 


■ 


"  -  p! 
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Since  d  8  G ^ t  tnere  is  at  least  one  type-2  (or  impliea 
type-2)  patn  in  0,^  from  a  to  v  passing  some  other  vertex  w. 
Thus  (G2  (J  taj)  =  G2 .  With  no  loss  of  generality  let  the 
path  from  u  to  w  be  type-2  (or  impliea  type-2).  Thus  (u,w)2 

6  b^  ana  (w,v)  .  €  G?,  l<i<2.  Then  (u,w)  ,  €  gT  ana 

1  z  —  —  z  1 

T  •  t  1 

(W , V) i  0  G^  since  b^  =  G2 •  G^  contains  a  type-2  (or 

impliea  type-2)  patn  from  u  to  w  ana  a  type-i  path  from  w  to 

v.  Since  G^  is  acyclic,  none  of  the  patns  from  u  to  w  ana 

from  w  to  v  in  G^  contains  a.  Thus  G^  -  {a}  also  contains  a 

type-2  (or  impliea  type-2)  patn  from  u  to  v,  ana  (b^  -  la})1 
T  T 

=  G{  =  G*.  * 

Lemma  A .  2 :  If  (u,v)  .  8  G-^  ana  (u,v)^  €  G2  where  l<i,j<3  ana 
T  T 

G-^  —  G2  then 

(  (Gx  -  {  (  u  ,  v )  i  )  U  i  ( u  ,  v )  k  } )  T  =  G*  =  G* 
where  k  =  i&j . 

Proof:  For  k  =  i  the  result  follows  immeaiately.  Since  i=j 
implies  k=i,  assume  i^j  ana  k^i.  Then  k=2  since  i&j=2 

for  i^j ,  l<i,j<3.  Moreover  (u,v)^  G  b^ ,  (u,v)  6  G2 r 

T  t  ■  T 

i/j  ,  ana  G^  =  b2  implies  (u,v)2  €82*  G^  contains  a 

type-2  (or  an  impliea  type-2)  path  from  u  to  v  inaepenaent 

of  the  type  of  (u,v)^  8  b^ .  Thus  the  arc  (u,v)^  in  G^  can 

T 

not  contnoute  to  a  type-1  or  a  type-3  arc  m  G^ . 

1  i'  t 

Tnerefore  (  (G^  -  {  ( u ,  v )  ^  j  )  U  {(u,v)k))  -  G^  -  G^  •  ff 

Lemma  A. 3:  Let  b  be  a  finite  acyclic  join  grapn.  Then  tne 

set  S ( G)  =  {b,  I  gT  =  G?)  is  closea  unaer  ft  ana  U. 

11^ 

proof :  Let  b^  ana  G2  be  any  two  numbers  of  S(G) ,  i.e. 


. 
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T  1'  T 

^3  ^  ~ ^ 2  —  ^  ^  B=(D^,..,,by}  dno 

C=  {c  -.,...  ,  c  }  be  the  set  of  arcs  wnere 

Jl  /L 

A  =  l(u,v).  |  (u,v)^GG-^  ana  (u,v)  ^or  an¥  3' 

l<irl<3) 

t.  =  t(u,v).  |  (u,v)  -GG-.  ana  (u,v)  GG9  for  any  j, 

D<ir 1<3} 

C  =  t(u,v)K  |  (u,v)ifeG1  ana  (u,v)  6G2»  K  =  i(bj  , 

3<i  , 3<3| 

By  aefinition  0  G2  =  ft  0  B  0  C  ana  G^  =  C.  There  is 
one-to-one  cor responaence  between  tne  arcs  in  G^-A  ana  the 
arcs  in  C  such  that  (u,v).GG-,-A  iff  (u,v).€C  where  k  =  i&j 
ana  ( u  ,  v )  €G2“B . 

Let  G-, -A=  |a ,a  j-  corresponds  to  c-GC,  l<i<z.  From 

-L  J .  Z  X 

Lemma  A. 2 

(  (G1-{a1))  U  ic1l)T  =  gJ' 

((((G1-{a1l)  U  { c 1 } )  -  { a 2 } )  U  i c 2 } ) ^  =  G^ 

T  t 

hence  (  (Gj_  -  ta^a^)  U  {c^c^})  =  ^ . 

By  successive  applications  of  Lemma  A. 2  ,  we  nave 

T 

(  (G^-  (a ^  .  r  a z  1 )  (j  {c  f  •  •  •  t  c ^  s ) 

r,  r,  .r.  rr, 

=  (A  L  C) 1  =  G^  =  G^  =  a1 . 

Similarly  (B  u  C) ^  =  G^  =  G1 .  Thus  (A  U  C) 

T  T 

=  (B  U  C)  =  G  . 

For  any  arc  a=(u,v) .  in  A,  l<i<3,  there  is  no  arc  from 
u  to  v  in  B  U  C.  Then,  by  successive  applications  of  Lemma 

A.l  for  every  arc  in  A 

-  {a1)  )  T  =  (B  U  C  U  { ax }  )  T  =  G*1 


(  (A  U  C) 


. 
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(((A  U  C)  -  (a.)  -  {a  ))T  =  (B  u  C  0  {a..,a0})T  =  GT 

-L  Z.  Z 


(  (A  U  C)  -  A)  1  =  CT  =  (b  0  C  U  A)  T  =  GT. 


Since  G^<BG2  =  AUbUC  ana  —  ^2  =  ^ ' 

<^1  n  g2)T  =  (Gi  <i  s*1 3 *  - 


—  ^2^S^  ana  bl  — 

Since  S(G)  is  a  finite  set,  proposition  A.l  foilovvs 
from  Lemma  A. 3.  (Note  that  proposition  A.l  is  a 
straigh tforwaro  extension  of  Theorem  1  in  [1]  for  the 
transitive  reduction  in  our  context).  Thus  from  Proposition 
A.l,  the  transitive  reauction  of  an  acyclic  join  graph  G  can 
be  ODtainea  by  successively  examining  each  arc  in  the 
transitive  closure  of  G  and  eliminating  those  arcs  which  are 
impliea  oy  transitivity. 

A . ^ .  Computing  the  transitive  reduction 

The  transitive  reauction  of  an  acyclic  join  graph 
containing  miore  than  one  connectea  component  simply  consists 
of  the  transitive  reauctions  of  its  connectea  components. 
Tnus,  it  is  sufficient  to  aiscuss  computing  tne  transitive 
reauction  of  an  acyclic  join  grapn  that  consists  of  one 
connectea  component.  Let  g  aenote  such  a  join  grapn  with  n 
vertices  ana  a  be  the  aajacency  matrix  of  G,  where 


A (i , J  )  = < 


' 1  if  ( i , J ) XCG 

2  if  ( i , J ) ~€G 

3  if  ( i , J ) 3CG  or  (j , i) 3€G 

v  0  otherwise 


1 4b 


for  ,  l<i,j<n,  ana  A(i,j)=0  for  l<i<n.  Note  that  a 
type-3  arc  (i,j)^  in  G  is  representea  by  two  entries, 
namely,  A(i,j)=A(j,i)=3  in  A.  Since  G  is  acyclic,  the 
vertices  in  G  may  oe  renamea  so  that  every  entry  oelow  tne 
diagonal  in  the  corresponding  adjacency  matrix  A  is  either  0 
or  3  (this  conversion  can  easily  oe  performed  in  time 
bounoea  oy  n  [15J  ) .  Consider  a  nonzero  entry  in  A  that  is 
below  the  diagonal,  i.e.  A(i,j)=3,  where  i > j  and  l<i,j<n. 
Such  an  entry  in  A  does  not  contribute  to  any  path  in  G 
since  the  arc  (i,j)^  is  also  representea  by  A(j,i)=3,  ana 
tnere  is  no  patn  from  i  to  3  in  G  for  any  i>j  ,  l<i,j<n. 
Consequently  the  entries  in  A  that  are  above  the  diagonal 
are  sufficient  to  represent  all  the  arcs  in  u.  (i.e.  with  no 
loss  of  generality  all  the  entries  that  are  oelow  the 
diagonal  after  tne  conversion  can  be  considered  as  zero). 

In  oraer  to  compute  the  transitive  reduction  of  an 
acyclic  join  graph  G,  first  the  vertices  in  G  are  renamea  so 
that  the  corresponding  adjacency  matrix  is  upper  triangular, 
as  discussed  above.  The  next  step  is  to  compute  the 
transitive  reduction  o,t:  of  the  resulting  grapn  G'.  Finally 
the  transitive  reduction  Gfc  of  G  can  be  obtained  from  G'1"  by 
restoring  tne  original  names  of  the  vertices.  Sinc^  the 
first  ana  the  last  steps  are  straightforward  ana  tahe  at 
most  G(nz)  time,  we  now  discuss  obtaining  the  transitive 
reduction  of  a  join  grapn  wnose  adjacency  matrix  is  upper 
t  r iangu 1  a  r . 


, 
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Let  G  be  a  finite  acyclic  join  graph  whose  adjacency 
matrix  A  is  upper  triangular.  The  algorithm  presented  below 
computes  the  transitive  reduction  G1"  of  G. 

Algorithm  T  : 

m  m 

1)  Find  the  transitive  closure  G±  of  G  and  let  A 

T 

denote  the  adjacency  matrix  of  G  . 

T 

2)  Compute  A-^=AA  and  let  G^  be  the  graph  whose 
adjacency  matrix  is  A-^  . 

3 )  Then  Gfc  is  GT  -  G1 . 

An  example  illustrating  Algorithm  T  is  given  in  Figure  A. 2. 

We  use  the  following  notation.  AB  denotes  the 
multiplication  of  two  (nxn)  adjacency  matrices,  A  and  B, 
using  the  multiplication  operator  *  and  the  addition 
operator  ©.  That  is,  C=AB  means 

C  (  i,  j  )  =  (  A  (  i,  1 )  *B  ( 1 ,  j  )  )<B  .  .  .©(  A  ( l,  n)  *B  (  n,  j  )  )  for  l<i,j<n. 
Similarly,  A©B  denotes  the  addition  of  A  and  B  such  that 
C=A©B  means  C ( i , j ) =A ( i , j ) ©B ( l , j )  for  l<i,j<n.  Now  it 
remains  to  establish  the  time  complexity  and  the  correctness 
of  A1  go r  l  thru  T  . 

T 

Since  G  is  acyclic,  the  adjacency  matrix  A  of  the 

transitive  closure  of  G  is  given  by 

T  2  n  - 1 

A  -A  ©  A  ©  .  .  .©  A1 

where  Ak=Ak-1A  for  l<k<n.  It  is  clear  that  AT  can  be 
computed  in  0(n  )  time  using  an  algorithm  similar  to 
Marshall 's  Algorithm  [25].  Since  the  complexity  of  step  2  is 
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G  :  1 


A 


0  13  0 
0  0  13 
0  0  0  2 
0  0  0  0 


(a)  Acyclic  Join  Graph  G  and  its  Upper  Triangular 
Adjacency  Matrix  A. 


G 


T. 


1 


0  12  2 
0  0  12 
0  0  0  2 
0  0  0  0 


Trans itive 


Closure 


T  T 

G  and  its  Adjaceny  Matrix  A 


A,  =  A  AT=  P0  0  1  2 
0  0  0  2 
0  0  0  0 
0  0  0  0 


(c)  G^  and  its  Adjacency  matrix  A^  (step  2) 
t  T 

G  —  G1-  Gi:  1 


(d)  Transitive  Reduction 


(step  3) 


(  step 


Figure  A. 2:  An  Example  for  Algorithm  T. 


■ 


lbl 


proportional  to  nJ,  ana  tnat  of  step  3  is  bounaea  by  nz  ,  tne 

.  3 

overall  complexity  of  Algorithm  T  is  0(n  ). 


The  correctness  of  Algorithm  T  follows  from  Lemma  A. 4. 

T  t 

wnich  shows  that  G  -G^  is  the  transitive  reauction  G  of  G. 

Lemma  A . 4 :  Let  G  be  an  acyclic  join  graph  whose  adjacency 

t  T 

matrix  A  is  upper  triangular.  Then  G  =G  -G^  ,  where  G-^  is  the 

T 

graph  whose  aajacency  matrix  A^=AA  ana  denotes  set 

subtraction. 


Proof :  by  definition  A-A0Az0 . . . 0A  .  Prom  the 

distributive  property  of  *  over  0,  A-^=Az:© .  .  . 0An  ] 

Clearly,  an  arc  a=(u,v)^  is  in  G-^  iff  tnere  is  a  patn  of 

lengtn  >2  from  u  to  v  in  g  implying  the  arc  (u,v)^  without 

using  the  arc  from  u  to  v  in  G,  l<i<3.  Thus  an  arc  a=(u,v)^ 

is  in  c.  out  not  in  G^  iff  none  of  the  paths  of  lengtn  >2 

from  u  to  v  in  o  implies  the  arc  a  without  using  tne  arc 

from  u  to  v  in  G.  Since  this  is  exactly  tne  condition  unuer 

t  -  T  -  t 

wnich  an  arc  a=(u,v)^6G  ,  we  nave  G  -G^-G  .  ft 


Thus,  the  transitive  reauction  Gt  of  an  acyclic  join 

graph  naving  one  connected  component  can  computed  in 

C(n3)  time,  where  n  is  tne  numoer  of  vertices  in  G. 

Consequently,  tne  transitive  reduction  of  an  acyclic  join 

grapn  having  m  connected  components  each  with  n^  vertices, 

m  ^ 

l<i<m,  taKes  at  most  G(  Z  n,-)  time. 

"  -  i=l  1 


